Notes:
Read in Pseudo Facebook data.
Notes:
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
names(pf)
## [1] "userid" "age"
## [3] "dob_day" "dob_year"
## [5] "dob_month" "gender"
## [7] "tenure" "friend_count"
## [9] "friendships_initiated" "likes"
## [11] "likes_received" "mobile_likes"
## [13] "mobile_likes_received" "www_likes"
## [15] "www_likes_received"
Notes:
# install.packages('ggplot2')
library(ggplot2)
qplot(x=dob_day, data=pf, binwidth = 1) +
scale_x_continuous(breaks = 1:31)
Response:
Day 1 and day 31
Notes:
Moira is looking at how people estimate their audience size.
Notes:
Response:
Response:
about 10%
Notes:
Notes:
qplot(x=dob_day, data=pf, binwidth = 1) +
scale_x_continuous(breaks = 1:31) +
facet_wrap(~dob_month, ncol=3)
Response:
Most of the 1st birthdays are in Janually, probably from FBs default or choosing the first option on the list.
Notes:
Some outliers are extreme examples, but other times they show bad data or errors in collection
Notes: #### Which case do you think applies to Moira’s outlier? Response:
Bad data about an extreme case.
Notes:
qplot(x=friend_count, data=pf, binwidth = 1)
Response:
Massive spike at low values and the scale on the axes is not very helpful.
Notes:
qplot(x=friend_count, data=pf, binwidth = 1, xlim=c(0,1000))
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes:
Lower binwidth gives more precise info but can become cluttered.
Notes:
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## Warning: Removed 2951 rows containing non-finite values (stat_bin).
Notes:
qplot(x = friend_count, data = pf[!is.na(pf$gender),], binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50)) +
facet_wrap(~gender)
## Warning: Removed 2949 rows containing non-finite values (stat_bin).
Can use na.omit but be careful because that will omit rows that have na in other values too.
Notes:
table(pf$gender)
##
## female male
## 40254 58574
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response:
Women
Response:
22
Response:
Because it is the middle number in the dataset and is not as influenced by the extreme outliers.
Notes:
qplot(x = tenure, data = pf, binwidth=30,
color = I('black'), fill= I('#099DD9'))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
qplot(x = tenure / 365, data = pf, binwidth=0.25,
color = I('black'), fill= I('#099DD9')) +
scale_x_continuous(breaks = c(0:8))
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Notes:
qplot(x = tenure / 365, data = pf, binwidth=0.25,
color = I('black'), fill= I('#099DD9')) +
scale_x_continuous(breaks = c(0:8), lim = c(0,7)) +
xlab('Number of yeas using Facebook') +
ylab('Number of users in sample')
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Notes:
qplot(x = age, data = pf, binwidth=1,
color = I('black'), fill= I('#099DD9')) +
scale_x_continuous(breaks = seq(0,115,5), lim = c(12,115)) +
xlab('Ages of Facebook Users') +
ylab('Number of users in sample')
Response:
There are an abnormal amount of users over 100 years old…
Notes:
Get the min max from the data with summary(pf$age)
Notes:
Memes tend to reaccure. Log scale instead of linear can show low numbers
Notes:
Engagement variables are often long tailed (over dispersed) log10(variable) with show -Inf for undefined variables such as 0
Notes:
library(gridExtra)
g1 <- ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_sqrt(breaks = seq(0, 1500, 50), limits = c(0, 1500))
g2 <- ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 0.1) +
scale_x_log10(breaks = seq(0, 1500, 50), limits = c(1, 1500))
g3 <- ggplot(aes(x = friend_count), data = pf) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = seq(0, 1500, 50), limits= c(1, 1500))
grid.arrange(g3, g2, g1)
## Warning: Removed 3485 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 3485 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 1523 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
ggplot(aes(x = friend_count, y = ..count../sum(..count..), color = gender),
data = subset(pf, !is.na(gender))) +
labs(x = "Friend Count",
y = "Proportion of Users with that friend count") +
geom_freqpoly(binwidth = 50) +
scale_x_continuous(lim = c(1000, 5000), breaks = seq(0, 1000, 50))
## Warning: Removed 95873 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_path).
Notes:
ggplot(aes(x = www_likes, y = ..count../sum(..count..), color = gender),
data = subset(pf, !is.na(gender))) +
labs(x = "Friend Count",
y = "Proportion of Users with that friend count") +
geom_freqpoly(binwidth = 0.1) +
scale_x_continuous() +#lim = c(0, 8000), breaks = seq(3000, 8000, 50)) +
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
by(pf$www_likes, pf$gender, sum)
## pf$gender: female
## [1] 3507665
## --------------------------------------------------------
## pf$gender: male
## [1] 1430175
Notes:
ggplot(aes(x = gender, y = friend_count, color = gender),
data = subset(pf, !is.na(gender))) +
labs(y = "Friend Count") +
geom_boxplot()
ggplot(aes(x = gender, y = friend_count, color = gender),
data = subset(pf, !is.na(gender))) +
labs(y = "Friend Count") +
geom_boxplot() +
coord_cartesian(ylim = c(0, 1000))
coord_cartesian is better because scale_y_continuous removes datapoints. coord_cartesian just changes the coordinate system.
Black line is Median
Notes:
ggplot(aes(x = gender, y = friend_count, color = gender),
data = subset(pf, !is.na(gender))) +
labs(y = "Friend Count") +
geom_boxplot() +
coord_cartesian(ylim = c(0, 250))
by(pf$friend_count, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 37 96 242 244 4923
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 27 74 165 182 4917
Response: #### Write about some ways that you can verify your answer. Response:
ggplot(aes(x = gender, y = friendships_initiated, color = gender),
data = subset(pf, !is.na(gender))) +
labs(y = "Friend Count") +
geom_boxplot() +
coord_cartesian(ylim = c(0, 150))
by(pf$friendships_initiated, pf$gender, summary)
## pf$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 49.0 113.9 124.8 3654.0
## --------------------------------------------------------
## pf$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 15.0 44.0 103.1 111.0 4144.0
Response:
Notes:
summary(pf$mobile_likes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 4.0 106.1 46.0 25111.0
summary(pf$mobile_likes > 0)
## Mode FALSE TRUE
## logical 35056 63947
pf$mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
#pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.6459 1.0000 1.0000
sum(pf$mobile_check_in)/length(pf$mobile_check_in)
## [1] 0.6459097
Response:
Reflection:
I learned that often you need to transform the dataset to show meaningful information. Also with data that has long tails it is usually better to use the Median instead of the Mean. Also learned several new ways of visualizing the data and how to modify the graphs to take a closer look at certain parts of the data.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!