Lesson 3 part 1
This commit is contained in:
parent
8c07674904
commit
4e8faf1624
0
lesson2/Rplots.pdf
Normal file
0
lesson2/Rplots.pdf
Normal file
13
lesson3/lesson3.Rproj
Normal file
13
lesson3/lesson3.Rproj
Normal file
@ -0,0 +1,13 @@
|
|||||||
|
Version: 1.0
|
||||||
|
|
||||||
|
RestoreWorkspace: Default
|
||||||
|
SaveWorkspace: Default
|
||||||
|
AlwaysSaveHistory: Default
|
||||||
|
|
||||||
|
EnableCodeIndexing: Yes
|
||||||
|
UseSpacesForTab: Yes
|
||||||
|
NumSpacesForTab: 2
|
||||||
|
Encoding: UTF-8
|
||||||
|
|
||||||
|
RnwWeave: Sweave
|
||||||
|
LaTeX: pdfLaTeX
|
||||||
@ -6,13 +6,15 @@ Lesson 3
|
|||||||
### What to Do First?
|
### What to Do First?
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Read in Pseudo Facebook data.
|
||||||
|
|
||||||
|
|
||||||
### Pseudo-Facebook User Data
|
### Pseudo-Facebook User Data
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Pseudo-Facebook User Data}
|
```{r Pseudo-Facebook User Data}
|
||||||
|
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
|
||||||
|
names(pf)
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
@ -21,8 +23,11 @@ Notes:
|
|||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Histogram of Users\' Birthdays}
|
```{r Histogram of Users\' Birthdays}
|
||||||
install.packages('ggplot2')
|
# install.packages('ggplot2')
|
||||||
library(ggplot2)
|
library(ggplot2)
|
||||||
|
|
||||||
|
qplot(x=dob_day, data=pf, binwidth = 1) +
|
||||||
|
scale_x_continuous(breaks = 1:31)
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
@ -30,17 +35,16 @@ library(ggplot2)
|
|||||||
#### What are some things that you notice about this histogram?
|
#### What are some things that you notice about this histogram?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
Day 1 and day 31
|
||||||
|
|
||||||
### Moira's Investigation
|
### Moira's Investigation
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Moira is looking at how people estimate their audience size.
|
||||||
|
|
||||||
### Estimating Your Audience Size
|
### Estimating Your Audience Size
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
|
||||||
|
|
||||||
#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
|
#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
|
||||||
Response:
|
Response:
|
||||||
@ -51,7 +55,7 @@ Response:
|
|||||||
#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
|
#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
about 10%
|
||||||
|
|
||||||
### Perceived Audience Size
|
### Perceived Audience Size
|
||||||
Notes:
|
Notes:
|
||||||
@ -61,25 +65,27 @@ Notes:
|
|||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Faceting}
|
```{r Faceting}
|
||||||
|
qplot(x=dob_day, data=pf, binwidth = 1) +
|
||||||
|
scale_x_continuous(breaks = 1:31) +
|
||||||
|
facet_wrap(~dob_month, ncol=3)
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Let’s take another look at our plot. What stands out to you here?
|
#### Let’s take another look at our plot. What stands out to you here?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
Most of the 1st birthdays are in Janually, probably from FBs default or choosing the first option on the list.
|
||||||
|
|
||||||
### Be Skeptical - Outliers and Anomalies
|
### Be Skeptical - Outliers and Anomalies
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Some outliers are extreme examples, but other times they show bad data or errors in collection
|
||||||
|
|
||||||
### Moira's Outlier
|
### Moira's Outlier
|
||||||
Notes:
|
Notes:
|
||||||
#### Which case do you think applies to Moira’s outlier?
|
#### Which case do you think applies to Moira’s outlier?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
Bad data about an extreme case.
|
||||||
|
|
||||||
### Friend Count
|
### Friend Count
|
||||||
Notes:
|
Notes:
|
||||||
@ -87,25 +93,25 @@ Notes:
|
|||||||
#### What code would you enter to create a histogram of friend counts?
|
#### What code would you enter to create a histogram of friend counts?
|
||||||
|
|
||||||
```{r Friend Count}
|
```{r Friend Count}
|
||||||
|
qplot(x=friend_count, data=pf, binwidth = 1)
|
||||||
```
|
```
|
||||||
|
|
||||||
#### How is this plot similar to Moira's first plot?
|
#### How is this plot similar to Moira's first plot?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
Massive spike at low values and the scale on the axes is not very helpful.
|
||||||
|
|
||||||
### Limiting the Axes
|
### Limiting the Axes
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Limiting the Axes}
|
```{r Limiting the Axes}
|
||||||
|
qplot(x=friend_count, data=pf, binwidth = 1, xlim=c(0,1000))
|
||||||
```
|
```
|
||||||
|
|
||||||
### Exploring with Bin Width
|
### Exploring with Bin Width
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Lower binwidth gives more precise info but can become cluttered.
|
||||||
|
|
||||||
### Adjusting the Bin Width
|
### Adjusting the Bin Width
|
||||||
Notes:
|
Notes:
|
||||||
@ -116,7 +122,8 @@ Notes:
|
|||||||
# Add it to the code below.
|
# Add it to the code below.
|
||||||
qplot(x = friend_count, data = pf, binwidth = 10) +
|
qplot(x = friend_count, data = pf, binwidth = 10) +
|
||||||
scale_x_continuous(limits = c(0, 1000),
|
scale_x_continuous(limits = c(0, 1000),
|
||||||
breaks = seq(0, 1000, 50))
|
breaks = seq(0, 1000, 50)) +
|
||||||
|
facet_wrap(~gender)
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
@ -125,34 +132,43 @@ qplot(x = friend_count, data = pf, binwidth = 10) +
|
|||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Omitting NA Values}
|
```{r Omitting NA Values}
|
||||||
|
qplot(x = friend_count, data = pf[!is.na(pf$gender),], binwidth = 10) +
|
||||||
|
scale_x_continuous(limits = c(0, 1000),
|
||||||
|
breaks = seq(0, 1000, 50)) +
|
||||||
|
facet_wrap(~gender)
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
Can use na.omit but be careful because that will omit rows that have na in other values too.
|
||||||
|
|
||||||
### Statistics 'by' Gender
|
### Statistics 'by' Gender
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Statistics \'by\' Gender}
|
```{r Statistics \'by\' Gender}
|
||||||
|
table(pf$gender)
|
||||||
|
by(pf$friend_count, pf$gender, summary)
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Who on average has more friends: men or women?
|
#### Who on average has more friends: men or women?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
|
Women
|
||||||
|
|
||||||
#### What's the difference between the median friend count for women and men?
|
#### What's the difference between the median friend count for women and men?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
|
22
|
||||||
|
|
||||||
#### Why would the median be a better measure than the mean?
|
#### Why would the median be a better measure than the mean?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
Because it is the middle number in the dataset and is not as influenced by the extreme outliers.
|
||||||
|
|
||||||
### Tenure
|
### Tenure
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Tenure}
|
```{r Tenure}
|
||||||
|
qplot(x = tenure, data = pf, binwidth=30,
|
||||||
|
color = I('black'), fill= I('#099DD9'))
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
@ -160,7 +176,9 @@ Notes:
|
|||||||
#### How would you create a histogram of tenure by year?
|
#### How would you create a histogram of tenure by year?
|
||||||
|
|
||||||
```{r Tenure Histogram by Year}
|
```{r Tenure Histogram by Year}
|
||||||
|
qplot(x = tenure / 365, data = pf, binwidth=0.25,
|
||||||
|
color = I('black'), fill= I('#099DD9')) +
|
||||||
|
scale_x_continuous(breaks = c(0:8))
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
@ -169,7 +187,11 @@ Notes:
|
|||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Labeling Plots}
|
```{r Labeling Plots}
|
||||||
|
qplot(x = tenure / 365, data = pf, binwidth=0.25,
|
||||||
|
color = I('black'), fill= I('#099DD9')) +
|
||||||
|
scale_x_continuous(breaks = c(0:8), lim = c(0,7)) +
|
||||||
|
xlab('Number of yeas using Facebook') +
|
||||||
|
ylab('Number of users in sample')
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
@ -178,34 +200,40 @@ Notes:
|
|||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r User Ages}
|
```{r User Ages}
|
||||||
|
qplot(x = age, data = pf, binwidth=1,
|
||||||
|
color = I('black'), fill= I('#099DD9')) +
|
||||||
|
scale_x_continuous(breaks = seq(0,115,5), lim = c(12,115)) +
|
||||||
|
xlab('Ages of Facebook Users') +
|
||||||
|
ylab('Number of users in sample')
|
||||||
```
|
```
|
||||||
|
|
||||||
#### What do you notice?
|
#### What do you notice?
|
||||||
Response:
|
Response:
|
||||||
|
|
||||||
***
|
There are an abnormal amount of users over 100 years old...
|
||||||
|
|
||||||
### The Spread of Memes
|
### The Spread of Memes
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Get the min max from the data with summary(pf$age)
|
||||||
|
|
||||||
### Lada's Money Bag Meme
|
### Lada's Money Bag Meme
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Memes tend to reaccure.
|
||||||
|
Log scale instead of linear can show low numbers
|
||||||
|
|
||||||
### Transforming Data
|
### Transforming Data
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
***
|
Engagement variables are often long tailed (over dispersed)
|
||||||
|
log10(variable) with show -Inf for undefined variables such as 0
|
||||||
|
|
||||||
### Add a Scaling Layer
|
### Add a Scaling Layer
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
```{r Add a Scaling Layer}
|
```{r Add a Scaling Layer}
|
||||||
|
qplot(x = log10(friend_count + 1), data = pf)
|
||||||
```
|
```
|
||||||
|
|
||||||
***
|
***
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user