diff --git a/lesson3/lesson3_student.html b/lesson3/lesson3_student.html new file mode 100644 index 0000000..644955e --- /dev/null +++ b/lesson3/lesson3_student.html @@ -0,0 +1,531 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + +
+

Lesson 3

+
+
+

What to Do First?

+

Notes:

+

Read in Pseudo Facebook data.

+
+
+

Pseudo-Facebook User Data

+

Notes:

+
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
+names(pf)
+
##  [1] "userid"                "age"                  
+##  [3] "dob_day"               "dob_year"             
+##  [5] "dob_month"             "gender"               
+##  [7] "tenure"                "friend_count"         
+##  [9] "friendships_initiated" "likes"                
+## [11] "likes_received"        "mobile_likes"         
+## [13] "mobile_likes_received" "www_likes"            
+## [15] "www_likes_received"
+
+
+
+

Histogram of Users’ Birthdays

+

Notes:

+
# install.packages('ggplot2')
+library(ggplot2)
+
+qplot(x=dob_day, data=pf, binwidth = 1) +
+  scale_x_continuous(breaks = 1:31)
+

+
+
+

What are some things that you notice about this histogram?

+

Response:

+

Day 1 and day 31

+
+
+
+

Moira’s Investigation

+

Notes:

+

Moira is looking at how people estimate their audience size.

+
+
+

Estimating Your Audience Size

+

Notes:

+
+

Think about a time when you posted a specific message or shared a photo on Facebook. What was it?

+

Response:

+
+
+

How many of your friends do you think saw that post?

+

Response:

+
+
+

Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?

+

Response:

+

about 10%

+
+
+
+

Perceived Audience Size

+

Notes:

+
+
+
+

Faceting

+

Notes:

+
qplot(x=dob_day, data=pf, binwidth = 1) +
+  scale_x_continuous(breaks = 1:31) +
+  facet_wrap(~dob_month, ncol=3)
+

+
+

Let’s take another look at our plot. What stands out to you here?

+

Response:

+

Most of the 1st birthdays are in Janually, probably from FBs default or choosing the first option on the list.

+
+
+
+

Be Skeptical - Outliers and Anomalies

+

Notes:

+

Some outliers are extreme examples, but other times they show bad data or errors in collection

+
+
+

Moira’s Outlier

+

Notes: #### Which case do you think applies to Moira’s outlier? Response:

+

Bad data about an extreme case.

+
+
+

Friend Count

+

Notes:

+
+

What code would you enter to create a histogram of friend counts?

+
qplot(x=friend_count, data=pf, binwidth = 1)
+

+
+
+

How is this plot similar to Moira’s first plot?

+

Response:

+

Massive spike at low values and the scale on the axes is not very helpful.

+
+
+
+

Limiting the Axes

+

Notes:

+
qplot(x=friend_count, data=pf, binwidth = 1, xlim=c(0,1000))
+
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
+

+
+
+

Exploring with Bin Width

+

Notes:

+

Lower binwidth gives more precise info but can become cluttered.

+
+
+

Adjusting the Bin Width

+

Notes:

+
+
+

Faceting Friend Count

+
# What code would you add to create a facet the histogram by gender?
+# Add it to the code below.
+qplot(x = friend_count, data = pf, binwidth = 10) +
+  scale_x_continuous(limits = c(0, 1000),
+                     breaks = seq(0, 1000, 50)) +
+  facet_wrap(~gender)
+
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
+

+
+
+
+

Omitting NA Values

+

Notes:

+
qplot(x = friend_count, data = pf[!is.na(pf$gender),], binwidth = 10) +
+  scale_x_continuous(limits = c(0, 1000),
+                     breaks = seq(0, 1000, 50)) +
+  facet_wrap(~gender)
+
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
+

+

Can use na.omit but be careful because that will omit rows that have na in other values too.

+
+
+

Statistics ‘by’ Gender

+

Notes:

+
table(pf$gender)
+
## 
+## female   male 
+##  40254  58574
+
by(pf$friend_count, pf$gender, summary)
+
## pf$gender: female
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##       0      37      96     242     244    4923 
+## -------------------------------------------------------- 
+## pf$gender: male
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##       0      27      74     165     182    4917
+
+

Who on average has more friends: men or women?

+

Response:

+

Women

+
+
+

What’s the difference between the median friend count for women and men?

+

Response:

+

22

+
+
+

Why would the median be a better measure than the mean?

+

Response:

+

Because it is the middle number in the dataset and is not as influenced by the extreme outliers.

+
+
+
+

Tenure

+

Notes:

+
qplot(x = tenure, data = pf, binwidth=30,
+      color = I('black'), fill= I('#099DD9'))
+
## Warning: Removed 2 rows containing non-finite values (stat_bin).
+

+
+
+

How would you create a histogram of tenure by year?

+
qplot(x = tenure / 365, data = pf, binwidth=0.25,
+      color = I('black'), fill= I('#099DD9')) +
+  scale_x_continuous(breaks = c(0:8))
+
## Warning: Removed 2 rows containing non-finite values (stat_bin).
+

+
+
+
+
+

Labeling Plots

+

Notes:

+
qplot(x = tenure / 365, data = pf, binwidth=0.25,
+      color = I('black'), fill= I('#099DD9')) +
+  scale_x_continuous(breaks = c(0:8), lim = c(0,7)) +
+  xlab('Number of yeas using Facebook') +
+  ylab('Number of users in sample')
+
## Warning: Removed 26 rows containing non-finite values (stat_bin).
+

+
+
+
+

User Ages

+

Notes:

+
qplot(x = age, data = pf, binwidth=1,
+      color = I('black'), fill= I('#099DD9')) +
+  scale_x_continuous(breaks = seq(0,115,5), lim = c(12,115)) +
+  xlab('Ages of Facebook Users') +
+  ylab('Number of users in sample')
+

+
+

What do you notice?

+

Response:

+

There are an abnormal amount of users over 100 years old…

+
+
+
+

The Spread of Memes

+

Notes:

+

Get the min max from the data with summary(pf$age)

+
+
+

Lada’s Money Bag Meme

+

Notes:

+

Memes tend to reaccure. Log scale instead of linear can show low numbers

+
+
+

Transforming Data

+

Notes:

+

Engagement variables are often long tailed (over dispersed) log10(variable) with show -Inf for undefined variables such as 0

+
+
+

Add a Scaling Layer

+

Notes:

+
library(gridExtra)
+g1 <- ggplot(aes(x = friend_count), data = pf) +
+  geom_histogram(binwidth = 1) +
+  scale_x_sqrt(breaks = seq(0, 1500, 50), limits = c(0, 1500))
+g2 <- ggplot(aes(x = friend_count), data = pf) +
+  geom_histogram(binwidth = 0.1) +
+  scale_x_log10(breaks = seq(0, 1500, 50), limits = c(1, 1500))
+g3 <- ggplot(aes(x = friend_count), data = pf) +
+  geom_histogram(binwidth = 1) +
+  scale_x_continuous(breaks = seq(0, 1500, 50), limits= c(1, 1500))
+grid.arrange(g3, g2, g1)
+
## Warning: Removed 3485 rows containing non-finite values (stat_bin).
+
## Warning: Transformation introduced infinite values in continuous x-axis
+
## Warning: Removed 3485 rows containing non-finite values (stat_bin).
+
## Warning: Removed 1 rows containing missing values (geom_bar).
+
## Warning: Removed 1523 rows containing non-finite values (stat_bin).
+
## Warning: Removed 1 rows containing missing values (geom_bar).
+

+
+
+
+

Frequency Polygons

+
ggplot(aes(x = friend_count, y = ..count../sum(..count..),  color = gender),
+       data = subset(pf, !is.na(gender))) +
+  labs(x = "Friend Count",
+       y = "Proportion of Users with that friend count") +
+  geom_freqpoly(binwidth = 50) +
+  scale_x_continuous(lim = c(1000, 5000), breaks = seq(0, 1000, 50))
+
## Warning: Removed 95873 rows containing non-finite values (stat_bin).
+
## Warning: Removed 4 rows containing missing values (geom_path).
+

+
+
+
+

Likes on the Web

+

Notes:

+
ggplot(aes(x = www_likes, y = ..count../sum(..count..),  color = gender),
+       data = subset(pf, !is.na(gender))) +
+  labs(x = "Friend Count",
+       y = "Proportion of Users with that friend count") +
+  geom_freqpoly(binwidth = 0.1) +
+  scale_x_continuous() +#lim = c(0, 8000), breaks = seq(3000, 8000, 50)) +
+  scale_x_log10()
+
## Scale for 'x' is already present. Adding another scale for 'x', which
+## will replace the existing scale.
+
## Warning: Transformation introduced infinite values in continuous x-axis
+
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
+

+
by(pf$www_likes, pf$gender, sum)
+
## pf$gender: female
+## [1] 3507665
+## -------------------------------------------------------- 
+## pf$gender: male
+## [1] 1430175
+
+
+
+

Box Plots

+

Notes:

+
ggplot(aes(x = gender, y = friend_count,  color = gender),
+       data = subset(pf, !is.na(gender))) +
+  labs(y = "Friend Count") +
+  geom_boxplot()
+

+
+

Adjust the code to focus on users who have friend counts between 0 and 1000.

+
ggplot(aes(x = gender, y = friend_count,  color = gender),
+       data = subset(pf, !is.na(gender))) +
+  labs(y = "Friend Count") +
+  geom_boxplot() +
+  coord_cartesian(ylim = c(0, 1000))
+

+

coord_cartesian is better because scale_y_continuous removes datapoints. coord_cartesian just changes the coordinate system.

+

Black line is Median

+
+
+
+

Box Plots, Quartiles, and Friendships

+

Notes:

+
ggplot(aes(x = gender, y = friend_count,  color = gender),
+       data = subset(pf, !is.na(gender))) +
+  labs(y = "Friend Count") +
+  geom_boxplot() +
+  coord_cartesian(ylim = c(0, 250))
+

+
by(pf$friend_count, pf$gender, summary)
+
## pf$gender: female
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##       0      37      96     242     244    4923 
+## -------------------------------------------------------- 
+## pf$gender: male
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##       0      27      74     165     182    4917
+
+

On average, who initiated more friendships in our sample: men or women?

+

Response: #### Write about some ways that you can verify your answer. Response:

+
ggplot(aes(x = gender, y = friendships_initiated,  color = gender),
+       data = subset(pf, !is.na(gender))) +
+  labs(y = "Friend Count") +
+  geom_boxplot() +
+  coord_cartesian(ylim = c(0, 150))
+

+
by(pf$friendships_initiated, pf$gender, summary)
+
## pf$gender: female
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##     0.0    19.0    49.0   113.9   124.8  3654.0 
+## -------------------------------------------------------- 
+## pf$gender: male
+##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##     0.0    15.0    44.0   103.1   111.0  4144.0
+

Response:

+
+
+
+
+

Getting Logical

+

Notes:

+
summary(pf$mobile_likes)
+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##     0.0     0.0     4.0   106.1    46.0 25111.0
+
summary(pf$mobile_likes > 0)
+
##    Mode   FALSE    TRUE 
+## logical   35056   63947
+
pf$mobile_check_in <- NA
+pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
+#pf$mobile_check_in <- factor(pf$mobile_check_in)
+summary(pf$mobile_check_in)
+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##  0.0000  0.0000  1.0000  0.6459  1.0000  1.0000
+
sum(pf$mobile_check_in)/length(pf$mobile_check_in)
+
## [1] 0.6459097
+

Response:

+
+
+
+

Analyzing One Variable

+

Reflection:

+

I learned that often you need to transform the dataset to show meaningful information. Also with data that has long tails it is usually better to use the Median instead of the Mean. Also learned several new ways of visualizing the data and how to modify the graphs to take a closer look at certain parts of the data.

+

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!

+
+
+ + + + +
+ + + + + + + + diff --git a/lesson3/lesson3_student.rmd b/lesson3/lesson3_student.rmd index fe4800e..cb351fd 100644 --- a/lesson3/lesson3_student.rmd +++ b/lesson3/lesson3_student.rmd @@ -233,7 +233,17 @@ log10(variable) with show -Inf for undefined variables such as 0 Notes: ```{r Add a Scaling Layer} -qplot(x = log10(friend_count + 1), data = pf) +library(gridExtra) +g1 <- ggplot(aes(x = friend_count), data = pf) + + geom_histogram(binwidth = 1) + + scale_x_sqrt(breaks = seq(0, 1500, 50), limits = c(0, 1500)) +g2 <- ggplot(aes(x = friend_count), data = pf) + + geom_histogram(binwidth = 0.1) + + scale_x_log10(breaks = seq(0, 1500, 50), limits = c(1, 1500)) +g3 <- ggplot(aes(x = friend_count), data = pf) + + geom_histogram(binwidth = 1) + + scale_x_continuous(breaks = seq(0, 1500, 50), limits= c(1, 1500)) +grid.arrange(g3, g2, g1) ``` *** @@ -242,7 +252,12 @@ qplot(x = log10(friend_count + 1), data = pf) ### Frequency Polygons ```{r Frequency Polygons} - +ggplot(aes(x = friend_count, y = ..count../sum(..count..), color = gender), + data = subset(pf, !is.na(gender))) + + labs(x = "Friend Count", + y = "Proportion of Users with that friend count") + + geom_freqpoly(binwidth = 50) + + scale_x_continuous(lim = c(1000, 5000), breaks = seq(0, 1000, 50)) ``` *** @@ -251,7 +266,15 @@ qplot(x = log10(friend_count + 1), data = pf) Notes: ```{r Likes on the Web} +ggplot(aes(x = www_likes, y = ..count../sum(..count..), color = gender), + data = subset(pf, !is.na(gender))) + + labs(x = "Friend Count", + y = "Proportion of Users with that friend count") + + geom_freqpoly(binwidth = 0.1) + + scale_x_continuous() +#lim = c(0, 8000), breaks = seq(3000, 8000, 50)) + + scale_x_log10() +by(pf$www_likes, pf$gender, sum) ``` @@ -261,22 +284,38 @@ Notes: Notes: ```{r Box Plots} - +ggplot(aes(x = gender, y = friend_count, color = gender), + data = subset(pf, !is.na(gender))) + + labs(y = "Friend Count") + + geom_boxplot() ``` + #### Adjust the code to focus on users who have friend counts between 0 and 1000. ```{r} - +ggplot(aes(x = gender, y = friend_count, color = gender), + data = subset(pf, !is.na(gender))) + + labs(y = "Friend Count") + + geom_boxplot() + + coord_cartesian(ylim = c(0, 1000)) ``` -*** +coord_cartesian is better because scale_y_continuous removes datapoints. coord_cartesian just changes the coordinate system. + +Black line is Median ### Box Plots, Quartiles, and Friendships Notes: ```{r Box Plots, Quartiles, and Friendships} +ggplot(aes(x = gender, y = friend_count, color = gender), + data = subset(pf, !is.na(gender))) + + labs(y = "Friend Count") + + geom_boxplot() + + coord_cartesian(ylim = c(0, 250)) +by(pf$friend_count, pf$gender, summary) ``` #### On average, who initiated more friendships in our sample: men or women? @@ -284,7 +323,13 @@ Response: #### Write about some ways that you can verify your answer. Response: ```{r Friend Requests by Gender} +ggplot(aes(x = gender, y = friendships_initiated, color = gender), + data = subset(pf, !is.na(gender))) + + labs(y = "Friend Count") + + geom_boxplot() + + coord_cartesian(ylim = c(0, 150)) +by(pf$friendships_initiated, pf$gender, summary) ``` Response: @@ -295,6 +340,15 @@ Response: Notes: ```{r Getting Logical} +summary(pf$mobile_likes) + +summary(pf$mobile_likes > 0) + +pf$mobile_check_in <- NA +pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0) +#pf$mobile_check_in <- factor(pf$mobile_check_in) +summary(pf$mobile_check_in) +sum(pf$mobile_check_in)/length(pf$mobile_check_in) ``` @@ -305,7 +359,7 @@ Response: ### Analyzing One Variable Reflection: -*** +I learned that often you need to transform the dataset to show meaningful information. Also with data that has long tails it is usually better to use the Median instead of the Mean. Also learned several new ways of visualizing the data and how to modify the graphs to take a closer look at certain parts of the data. Click **KnitHTML** to see all of your hard work and to have an html page of this lesson, your answers, and your notes! \ No newline at end of file