Lesson 4

Scatterplots and Perceived Audience Size

Notes:

Scatterplots

Notes:

library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point()

What are some things that you notice right away?

Response:

All of the data points are grouped into vertical lines and that the younger the age the more likely they are to have more friends.

ggplot Syntax

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point() +
  xlim(13, 90)

## Warning: Removed 4906 rows containing missing values (geom_point).

summary(pf$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

Build one layer at a time to find errors easier

Overplotting

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_jitter(alpha = 1/20) +
  xlim(13, 90)

## Warning: Removed 5183 rows containing missing values (geom_point).

What do you notice in the plot?

Response:

The bar for 69 is still clearly visible and it is more obvious that the number generally decreases as the age increases.

Coord_trans()

Notes:

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 4906 rows containing missing values (geom_point).

What do you notice?

First off coord_trans does not work with geom_jitter, second the datapoints near the bottom are more spread out vertically to present them as more of a focus.

To use jitter you need more advanced syntax to only jitter the ages, also to prevent possible negatives if 0 is jittered. To do this in geom_point() pass position = position_jitter(h = 0)

ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20, position = position_jitter(h = 0)) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 5197 rows containing missing values (geom_point).

Alpha and Jitter

Notes:

ggplot(aes(x = age, y = friendships_initiated, color = gender), data = pf) +
  geom_point(alpha = 1/10, position = position_jitter(h = 0)) +
  xlim(13, 90) +
  coord_trans(y = "sqrt")

## Warning: Removed 5189 rows containing missing values (geom_point).

Overplotting and Domain Knowledge

Notes:

plotting as a percentage of the whole

Conditional Means

Notes:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())
pf.fc_by_age <- arrange(pf.fc_by_age, age)

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
  geom_line() +
  xlim(13,90)

## Warning: Removed 23 rows containing missing values (geom_path).

Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x = age, y = friendships_initiated), data = pf) +
  geom_point(alpha = 1/10, position = position_jitter(h = 0), color = 'orange') +
  xlim(13, 90) +
  coord_trans(y = "sqrt") +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = median, color = 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) +
  coord_cartesian(xlim = c(13,70), ylim = c(0,1000))

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 5182 rows containing missing values (geom_point).

What are some of your observations of the plot?

Response:

I notice that the median is always lower than the mean and that the median is closer to the center of the main body of datapoints. It appears that the data is long tailed towards the high friend counts which pulls the mean upwards.

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:

Correlation

Notes:

cor.test(pf$age, pf$friend_count)

## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:

-0.027

Correlation on Subsets

Notes:

with(pf[pf$age <= 70,], cor.test(age, friend_count))

## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes:

http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/

Create Scatterplots

Notes:

library(ggplot2)
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point()#alpha = 1/20, position = position_jitter(h = 0)) +

  #xlim(13, 90) +
  #coord_trans(y = "sqrt")

Strong Correlations

Notes:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point() +
  xlim(0, quantile(pf$www_likes_received, 0.95)) +
  ylim(0, quantile(pf$likes_received, 0.95)) +
  geom_smooth(method = 'lm', color = 'red')

## Warning: Removed 6075 rows containing non-finite values (stat_smooth).

## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

with(pf, cor.test(www_likes_received, likes_received))

## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response:

0.948 Variable is a superset of another

Moira on Correlation

Notes:

Highly corelated can mean that variables are dependent on the same thing or are similar.

More Caution with Correlation

Notes:

#install.packages('alr3')
library(alr3)

## Loading required package: car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(ggplot2)
data(Mitchell)
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point()

Create your plot!

Noisy Scatterplots

Take a guess for the correlation coefficient for the scatterplot.

0.9

What is the actual correlation of the two variables? (Round to the thousandths place)

with(Mitchell, cor.test(Month, Temp))

## 
##  Pearson's product-moment correlation
## 
## data:  Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes:

ggplot(aes(Month, Temp), data = Mitchell) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 204, 12))

A New Perspective

What do you notice? Response:

There is a cyclical pattern to the data going from low to high and back to low every 12 months. This is why I originally said there seems to be a 0.9 correlation coefficient to the data because I saw this pattern the first time I looked at the plot.

Watch the solution video and check out the Instructor Notes! Notes:

ggplot(aes(x = (Month%%12), y = Temp), data = Mitchell) +
  geom_point()

Understanding Noise: Age to Age Months

Notes:

pf$age_with_months <- (pf$age) + (1 - (pf$dob_month/12))
head(pf)

##    userid age dob_day dob_year dob_month gender tenure friend_count
## 1 2094382  14      19     1999        11   male    266            0
## 2 1192601  14       2     1999        11 female      6            0
## 3 2083884  14      16     1999        11   male     13            0
## 4 1203168  14      25     1999        12 female     93            0
## 5 1733186  14       4     1999        12   male     82            0
## 6 1524765  14       1     1999        12   male     15            0
##   friendships_initiated likes likes_received mobile_likes
## 1                     0     0              0            0
## 2                     0     0              0            0
## 3                     0     0              0            0
## 4                     0     0              0            0
## 5                     0     0              0            0
## 6                     0     0              0            0
##   mobile_likes_received www_likes www_likes_received age_with_months
## 1                     0         0                  0        14.08333
## 2                     0         0                  0        14.08333
## 3                     0         0                  0        14.08333
## 4                     0         0                  0        14.00000
## 5                     0         0                  0        14.00000
## 6                     0         0                  0        14.00000

Age with Months Means

library(dplyr)

age_with_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarize(
  age_with_months,
  friend_count_mean = mean(friend_count),
  friend_count_median = median(friend_count),
  n = n()
)

pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)

head(pf.fc_by_age_months)

## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1            13.2              46.3                30.5     6
## 2            13.2             115.                 23.5    14
## 3            13.3             136.                 44.0    25
## 4            13.4             164.                 72.0    33
## 5            13.5             131.                 66.0    45
## 6            13.6             157.                 64.0    54

Noise in Conditional Means

ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months<71)) +
  geom_line()

Smoothing Conditional Means

Notes:

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) +
  geom_line() +
  geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line() +
  geom_smooth()
p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count), data = subset(pf, age < 71)) +
  geom_line(stat = 'summary', fun.y = 'mean')
grid.arrange(p1, p2, p3)

## `geom_smooth()` using method = 'loess'

## `geom_smooth()` using method = 'loess'

Which Plot to Choose?

Notes:

Make multiple plots during the exploritory phase and then refine them down into the best plots for distribution.

Analyzing Two Variables

Reflection:

Making multiple plots can show different features of the data. Also while summaries and correlations are good for a lot of things they are not always the best at portraying the data.

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!