udacity_eda/lesson4/lesson4_student.rmd
2018-05-02 22:51:21 -08:00

351 lines
8.1 KiB
Plaintext

---
output:
pdf_document: default
html_document: default
---
Lesson 4
========================================================
***
### Scatterplots and Perceived Audience Size
Notes:
***
### Scatterplots
Notes:
```{r Scatterplots}
library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point()
```
***
#### What are some things that you notice right away?
Response:
All of the data points are grouped into vertical lines and that the younger the age the more likely they are to have more friends.
### ggplot Syntax
Notes:
```{r ggplot Syntax}
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point() +
xlim(13, 90)
summary(pf$age)
```
Build one layer at a time to find errors easier
### Overplotting
Notes:
```{r Overplotting}
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_jitter(alpha = 1/20) +
xlim(13, 90)
```
#### What do you notice in the plot?
Response:
The bar for 69 is still clearly visible and it is more obvious that the number generally decreases as the age increases.
### Coord_trans()
Notes:
```{r Coord_trans()}
```
#### Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!
```{r}
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20) +
xlim(13, 90) +
coord_trans(y = "sqrt")
```
#### What do you notice?
First off coord_trans does not work with geom_jitter, second the datapoints near the bottom are more spread out vertically to present them as more of a focus.
To use jitter you need more advanced syntax to only jitter the ages, also to prevent possible negatives if 0 is jittered.
To do this in `geom_point()` pass `position = position_jitter(h = 0)`
```{r coord_trans_advanced}
ggplot(aes(x = age, y = friend_count), data = pf) +
geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
```
### Alpha and Jitter
Notes:
```{r Alpha and Jitter}
ggplot(aes(x = age, y = friendships_initiated, color = gender), data = pf) +
geom_point(alpha = 1/10, position = position_jitter(h = 0)) +
xlim(13, 90) +
coord_trans(y = "sqrt")
```
***
### Overplotting and Domain Knowledge
Notes:
plotting as a percentage of the whole
### Conditional Means
Notes:
```{r Conditional Means}
library(dplyr)
age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
geom_line() +
xlim(13,90)
```
### Overlaying Summaries with Raw Data
Notes:
```{r Overlaying Summaries with Raw Data}
ggplot(aes(x = age, y = friendships_initiated), data = pf) +
geom_point(alpha = 1/10, position = position_jitter(h = 0), color = 'orange') +
xlim(13, 90) +
coord_trans(y = "sqrt") +
geom_line(stat = 'summary', fun.y = mean) +
geom_line(stat = 'summary', fun.y = median, color = 'blue') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) +
coord_cartesian(xlim = c(13,70), ylim = c(0,1000))
```
#### What are some of your observations of the plot?
Response:
I notice that the median is always lower than the mean and that the median is closer to the center of the main body of datapoints. It appears that the data is long tailed towards the high friend counts which pulls the mean upwards.
### Moira: Histogram Summary and Scatterplot
See the Instructor Notes of this video to download Moira's paper on perceived audience size and to see the final plot.
Notes:
***
### Correlation
Notes:
```{r Correlation}
cor.test(pf$age, pf$friend_count)
```
Look up the documentation for the cor.test function.
What's the correlation between age and friend count? Round to three decimal places.
Response:
-0.027
### Correlation on Subsets
Notes:
```{r Correlation on Subsets}
with(pf[pf$age <= 70,], cor.test(age, friend_count))
```
***
### Correlation Methods
Notes:
http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
## Create Scatterplots
Notes:
```{r}
library(ggplot2)
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point()#alpha = 1/20, position = position_jitter(h = 0)) +
#xlim(13, 90) +
#coord_trans(y = "sqrt")
```
***
### Strong Correlations
Notes:
```{r Strong Correlations}
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
geom_point() +
xlim(0, quantile(pf$www_likes_received, 0.95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method = 'lm', color = 'red')
```
What's the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
```{r Correlation Calcuation}
with(pf, cor.test(www_likes_received, likes_received))
```
Response:
0.948
Variable is a superset of another
### Moira on Correlation
Notes:
Highly corelated can mean that variables are dependent on the same thing or are similar.
### More Caution with Correlation
Notes:
```{r More Caution With Correlation}
#install.packages('alr3')
library(alr3)
library(ggplot2)
data(Mitchell)
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
geom_point()
```
Create your plot!
```{r Temp vs Month}
```
***
### Noisy Scatterplots
a. Take a guess for the correlation coefficient for the scatterplot.
0.9
b. What is the actual correlation of the two variables?
(Round to the thousandths place)
```{r Noisy Scatterplots}
with(Mitchell, cor.test(Month, Temp))
```
***
### Making Sense of Data
Notes:
```{r Making Sense of Data}
ggplot(aes(Month, Temp), data = Mitchell) +
geom_point() +
scale_x_continuous(breaks = seq(0, 204, 12))
```
***
### A New Perspective
What do you notice?
Response:
There is a cyclical pattern to the data going from low to high and back to low every 12 months. This is why I originally said there seems to be a 0.9 correlation coefficient to the data because I saw this pattern the first time I looked at the plot.
Watch the solution video and check out the Instructor Notes!
Notes:
```{r}
ggplot(aes(x = (Month%%12), y = Temp), data = Mitchell) +
geom_point()
```
### Understanding Noise: Age to Age Months
Notes:
```{r Understanding Noise: Age to Age Months}
pf$age_with_months <- (pf$age) + (1 - (pf$dob_month/12))
head(pf)
```
***
### Age with Months Means
```{r Age with Months Means}
library(dplyr)
age_with_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarize(
age_with_months,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()
)
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)
head(pf.fc_by_age_months)
```
### Noise in Conditional Means
```{r Noise in Conditional Means}
ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months<71)) +
geom_line()
```
***
### Smoothing Conditional Means
Notes:
```{r Smoothing Conditional Means}
library(gridExtra)
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
geom_line() +
geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
geom_line() +
geom_smooth()
p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count), data = subset(pf, age < 71)) +
geom_line(stat = 'summary', fun.y = 'mean')
grid.arrange(p1, p2, p3)
```
***
### Which Plot to Choose?
Notes:
Make multiple plots during the exploritory phase and then refine them down into the best plots for distribution.
### Analyzing Two Variables
Reflection:
Making multiple plots can show different features of the data. Also while summaries and correlations are good for a lot of things they are not always the best at portraying the data.
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!