udacity_eda/lesson3/lesson3_student.rmd

Lesson 3
========================================================

***

### What to Do First?
Notes:

Read in Pseudo Facebook data.


### Pseudo-Facebook User Data
Notes:

```{r Pseudo-Facebook User Data}
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
names(pf)
```

***

### Histogram of Users' Birthdays
Notes:

```{r Histogram of Users\' Birthdays}
# install.packages('ggplot2')
library(ggplot2)

qplot(x=dob_day, data=pf, binwidth = 1) +
  scale_x_continuous(breaks = 1:31)
```

***

#### What are some things that you notice about this histogram?
Response:

Day 1 and day 31

### Moira's Investigation
Notes:

Moira is looking at how people estimate their audience size.

### Estimating Your Audience Size
Notes:


#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:

#### How many of your friends do you think saw that post?
Response:

#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:

about 10%

### Perceived Audience Size
Notes:

***
### Faceting
Notes:

```{r Faceting}
qplot(x=dob_day, data=pf, binwidth = 1) +
  scale_x_continuous(breaks = 1:31) +
  facet_wrap(~dob_month, ncol=3)
```

#### Let’s take another look at our plot. What stands out to you here?
Response:

Most of the 1st birthdays are in Janually, probably from FBs default or choosing the first option on the list.

### Be Skeptical - Outliers and Anomalies
Notes:

Some outliers are extreme examples, but other times they show bad data or errors in collection

### Moira's Outlier
Notes:
#### Which case do you think applies to Moira’s outlier?
Response:

Bad data about an extreme case.

### Friend Count
Notes:

#### What code would you enter to create a histogram of friend counts?

```{r Friend Count}
qplot(x=friend_count, data=pf, binwidth = 1)
```

#### How is this plot similar to Moira's first plot?
Response:

Massive spike at low values and the scale on the axes is not very helpful.

### Limiting the Axes
Notes:

```{r Limiting the Axes}
qplot(x=friend_count, data=pf, binwidth = 1, xlim=c(0,1000))
```

### Exploring with Bin Width
Notes:

Lower binwidth gives more precise info but can become cluttered.

### Adjusting the Bin Width
Notes:

### Faceting Friend Count
```{r Faceting Friend Count}
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
```

***

### Omitting NA Values
Notes:

```{r Omitting NA Values}
qplot(x = friend_count, data = pf[!is.na(pf$gender),], binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
```

Can use na.omit but be careful because that will omit rows that have na in other values too.

### Statistics 'by' Gender
Notes:

```{r Statistics \'by\' Gender}
table(pf$gender)
by(pf$friend_count, pf$gender, summary)
```

#### Who on average has more friends: men or women?
Response:

Women

#### What's the difference between the median friend count for women and men?
Response:

22

#### Why would the median be a better measure than the mean?
Response:

Because it is the middle number in the dataset and is not as influenced by the extreme outliers.

### Tenure
Notes:

```{r Tenure}
qplot(x = tenure, data = pf, binwidth=30,
      color = I('black'), fill= I('#099DD9'))
```

***

#### How would you create a histogram of tenure by year?

```{r Tenure Histogram by Year}
qplot(x = tenure / 365, data = pf, binwidth=0.25,
      color = I('black'), fill= I('#099DD9')) +
  scale_x_continuous(breaks = c(0:8))
```

***

### Labeling Plots
Notes:

```{r Labeling Plots}
qplot(x = tenure / 365, data = pf, binwidth=0.25,
      color = I('black'), fill= I('#099DD9')) +
  scale_x_continuous(breaks = c(0:8), lim = c(0,7)) +
  xlab('Number of yeas using Facebook') +
  ylab('Number of users in sample')
```

***

### User Ages
Notes:

```{r User Ages}
qplot(x = age, data = pf, binwidth=1,
      color = I('black'), fill= I('#099DD9')) +
  scale_x_continuous(breaks = seq(0,115,5), lim = c(12,115)) +
  xlab('Ages of Facebook Users') +
  ylab('Number of users in sample')
```

#### What do you notice?
Response:

There are an abnormal amount of users over 100 years old...

### The Spread of Memes
Notes:

Get the min max from the data with summary(pf$age)

### Lada's Money Bag Meme
Notes:

Memes tend to reaccure.
Log scale instead of linear can show low numbers

### Transforming Data
Notes:

Engagement variables are often long tailed (over dispersed)
log10(variable) with show -Inf for undefined variables such as 0

### Add a Scaling Layer
Notes:

```{r Add a Scaling Layer}
qplot(x = log10(friend_count + 1), data = pf)
```

***


### Frequency Polygons

```{r Frequency Polygons}

```

***

### Likes on the Web
Notes:

```{r Likes on the Web}

```


***

### Box Plots
Notes:

```{r Box Plots}

```

#### Adjust the code to focus on users who have friend counts between 0 and 1000.

```{r}

```

***

### Box Plots, Quartiles, and Friendships
Notes:

```{r Box Plots, Quartiles, and Friendships}

```

#### On average, who initiated more friendships in our sample: men or women?
Response:
#### Write about some ways that you can verify your answer.
Response:
```{r Friend Requests by Gender}

```

Response:

***

### Getting Logical
Notes:

```{r Getting Logical}

```

Response:

***

### Analyzing One Variable
Reflection:

***

Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!