Lesson 3 ======================================================== *** ### What to Do First? Notes: Read in Pseudo Facebook data. ### Pseudo-Facebook User Data Notes: ```{r Pseudo-Facebook User Data} pf <- read.csv('pseudo_facebook.tsv', sep='\t') names(pf) ``` *** ### Histogram of Users' Birthdays Notes: ```{r Histogram of Users\' Birthdays} # install.packages('ggplot2') library(ggplot2) qplot(x=dob_day, data=pf, binwidth = 1) + scale_x_continuous(breaks = 1:31) ``` *** #### What are some things that you notice about this histogram? Response: Day 1 and day 31 ### Moira's Investigation Notes: Moira is looking at how people estimate their audience size. ### Estimating Your Audience Size Notes: #### Think about a time when you posted a specific message or shared a photo on Facebook. What was it? Response: #### How many of your friends do you think saw that post? Response: #### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is? Response: about 10% ### Perceived Audience Size Notes: *** ### Faceting Notes: ```{r Faceting} qplot(x=dob_day, data=pf, binwidth = 1) + scale_x_continuous(breaks = 1:31) + facet_wrap(~dob_month, ncol=3) ``` #### Let’s take another look at our plot. What stands out to you here? Response: Most of the 1st birthdays are in Janually, probably from FBs default or choosing the first option on the list. ### Be Skeptical - Outliers and Anomalies Notes: Some outliers are extreme examples, but other times they show bad data or errors in collection ### Moira's Outlier Notes: #### Which case do you think applies to Moira’s outlier? Response: Bad data about an extreme case. ### Friend Count Notes: #### What code would you enter to create a histogram of friend counts? ```{r Friend Count} qplot(x=friend_count, data=pf, binwidth = 1) ``` #### How is this plot similar to Moira's first plot? Response: Massive spike at low values and the scale on the axes is not very helpful. ### Limiting the Axes Notes: ```{r Limiting the Axes} qplot(x=friend_count, data=pf, binwidth = 1, xlim=c(0,1000)) ``` ### Exploring with Bin Width Notes: Lower binwidth gives more precise info but can become cluttered. ### Adjusting the Bin Width Notes: ### Faceting Friend Count ```{r Faceting Friend Count} # What code would you add to create a facet the histogram by gender? # Add it to the code below. qplot(x = friend_count, data = pf, binwidth = 10) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + facet_wrap(~gender) ``` *** ### Omitting NA Values Notes: ```{r Omitting NA Values} qplot(x = friend_count, data = pf[!is.na(pf$gender),], binwidth = 10) + scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) + facet_wrap(~gender) ``` Can use na.omit but be careful because that will omit rows that have na in other values too. ### Statistics 'by' Gender Notes: ```{r Statistics \'by\' Gender} table(pf$gender) by(pf$friend_count, pf$gender, summary) ``` #### Who on average has more friends: men or women? Response: Women #### What's the difference between the median friend count for women and men? Response: 22 #### Why would the median be a better measure than the mean? Response: Because it is the middle number in the dataset and is not as influenced by the extreme outliers. ### Tenure Notes: ```{r Tenure} qplot(x = tenure, data = pf, binwidth=30, color = I('black'), fill= I('#099DD9')) ``` *** #### How would you create a histogram of tenure by year? ```{r Tenure Histogram by Year} qplot(x = tenure / 365, data = pf, binwidth=0.25, color = I('black'), fill= I('#099DD9')) + scale_x_continuous(breaks = c(0:8)) ``` *** ### Labeling Plots Notes: ```{r Labeling Plots} qplot(x = tenure / 365, data = pf, binwidth=0.25, color = I('black'), fill= I('#099DD9')) + scale_x_continuous(breaks = c(0:8), lim = c(0,7)) + xlab('Number of yeas using Facebook') + ylab('Number of users in sample') ``` *** ### User Ages Notes: ```{r User Ages} qplot(x = age, data = pf, binwidth=1, color = I('black'), fill= I('#099DD9')) + scale_x_continuous(breaks = seq(0,115,5), lim = c(12,115)) + xlab('Ages of Facebook Users') + ylab('Number of users in sample') ``` #### What do you notice? Response: There are an abnormal amount of users over 100 years old... ### The Spread of Memes Notes: Get the min max from the data with summary(pf$age) ### Lada's Money Bag Meme Notes: Memes tend to reaccure. Log scale instead of linear can show low numbers ### Transforming Data Notes: Engagement variables are often long tailed (over dispersed) log10(variable) with show -Inf for undefined variables such as 0 ### Add a Scaling Layer Notes: ```{r Add a Scaling Layer} qplot(x = log10(friend_count + 1), data = pf) ``` *** ### Frequency Polygons ```{r Frequency Polygons} ``` *** ### Likes on the Web Notes: ```{r Likes on the Web} ``` *** ### Box Plots Notes: ```{r Box Plots} ``` #### Adjust the code to focus on users who have friend counts between 0 and 1000. ```{r} ``` *** ### Box Plots, Quartiles, and Friendships Notes: ```{r Box Plots, Quartiles, and Friendships} ``` #### On average, who initiated more friendships in our sample: men or women? Response: #### Write about some ways that you can verify your answer. Response: ```{r Friend Requests by Gender} ``` Response: *** ### Getting Logical Notes: ```{r Getting Logical} ``` Response: *** ### Analyzing One Variable Reflection: *** Click **KnitHTML** to see all of your hard work and to have an html page of this lesson, your answers, and your notes!