udacity_eda/lesson3/lesson3_student.rmd
2018-04-17 23:17:42 -08:00

311 lines
5.6 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Lesson 3
========================================================
***
### What to Do First?
Notes:
Read in Pseudo Facebook data.
### Pseudo-Facebook User Data
Notes:
```{r Pseudo-Facebook User Data}
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
names(pf)
```
***
### Histogram of Users' Birthdays
Notes:
```{r Histogram of Users\' Birthdays}
# install.packages('ggplot2')
library(ggplot2)
qplot(x=dob_day, data=pf, binwidth = 1) +
scale_x_continuous(breaks = 1:31)
```
***
#### What are some things that you notice about this histogram?
Response:
Day 1 and day 31
### Moira's Investigation
Notes:
Moira is looking at how people estimate their audience size.
### Estimating Your Audience Size
Notes:
#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:
#### How many of your friends do you think saw that post?
Response:
#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:
about 10%
### Perceived Audience Size
Notes:
***
### Faceting
Notes:
```{r Faceting}
qplot(x=dob_day, data=pf, binwidth = 1) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month, ncol=3)
```
#### Lets take another look at our plot. What stands out to you here?
Response:
Most of the 1st birthdays are in Janually, probably from FBs default or choosing the first option on the list.
### Be Skeptical - Outliers and Anomalies
Notes:
Some outliers are extreme examples, but other times they show bad data or errors in collection
### Moira's Outlier
Notes:
#### Which case do you think applies to Moiras outlier?
Response:
Bad data about an extreme case.
### Friend Count
Notes:
#### What code would you enter to create a histogram of friend counts?
```{r Friend Count}
qplot(x=friend_count, data=pf, binwidth = 1)
```
#### How is this plot similar to Moira's first plot?
Response:
Massive spike at low values and the scale on the axes is not very helpful.
### Limiting the Axes
Notes:
```{r Limiting the Axes}
qplot(x=friend_count, data=pf, binwidth = 1, xlim=c(0,1000))
```
### Exploring with Bin Width
Notes:
Lower binwidth gives more precise info but can become cluttered.
### Adjusting the Bin Width
Notes:
### Faceting Friend Count
```{r Faceting Friend Count}
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
```
***
### Omitting NA Values
Notes:
```{r Omitting NA Values}
qplot(x = friend_count, data = pf[!is.na(pf$gender),], binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
```
Can use na.omit but be careful because that will omit rows that have na in other values too.
### Statistics 'by' Gender
Notes:
```{r Statistics \'by\' Gender}
table(pf$gender)
by(pf$friend_count, pf$gender, summary)
```
#### Who on average has more friends: men or women?
Response:
Women
#### What's the difference between the median friend count for women and men?
Response:
22
#### Why would the median be a better measure than the mean?
Response:
Because it is the middle number in the dataset and is not as influenced by the extreme outliers.
### Tenure
Notes:
```{r Tenure}
qplot(x = tenure, data = pf, binwidth=30,
color = I('black'), fill= I('#099DD9'))
```
***
#### How would you create a histogram of tenure by year?
```{r Tenure Histogram by Year}
qplot(x = tenure / 365, data = pf, binwidth=0.25,
color = I('black'), fill= I('#099DD9')) +
scale_x_continuous(breaks = c(0:8))
```
***
### Labeling Plots
Notes:
```{r Labeling Plots}
qplot(x = tenure / 365, data = pf, binwidth=0.25,
color = I('black'), fill= I('#099DD9')) +
scale_x_continuous(breaks = c(0:8), lim = c(0,7)) +
xlab('Number of yeas using Facebook') +
ylab('Number of users in sample')
```
***
### User Ages
Notes:
```{r User Ages}
qplot(x = age, data = pf, binwidth=1,
color = I('black'), fill= I('#099DD9')) +
scale_x_continuous(breaks = seq(0,115,5), lim = c(12,115)) +
xlab('Ages of Facebook Users') +
ylab('Number of users in sample')
```
#### What do you notice?
Response:
There are an abnormal amount of users over 100 years old...
### The Spread of Memes
Notes:
Get the min max from the data with summary(pf$age)
### Lada's Money Bag Meme
Notes:
Memes tend to reaccure.
Log scale instead of linear can show low numbers
### Transforming Data
Notes:
Engagement variables are often long tailed (over dispersed)
log10(variable) with show -Inf for undefined variables such as 0
### Add a Scaling Layer
Notes:
```{r Add a Scaling Layer}
qplot(x = log10(friend_count + 1), data = pf)
```
***
### Frequency Polygons
```{r Frequency Polygons}
```
***
### Likes on the Web
Notes:
```{r Likes on the Web}
```
***
### Box Plots
Notes:
```{r Box Plots}
```
#### Adjust the code to focus on users who have friend counts between 0 and 1000.
```{r}
```
***
### Box Plots, Quartiles, and Friendships
Notes:
```{r Box Plots, Quartiles, and Friendships}
```
#### On average, who initiated more friendships in our sample: men or women?
Response:
#### Write about some ways that you can verify your answer.
Response:
```{r Friend Requests by Gender}
```
Response:
***
### Getting Logical
Notes:
```{r Getting Logical}
```
Response:
***
### Analyzing One Variable
Reflection:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!