Initial Commit with Project Code

This commit is contained in:
Dusty.P 2018-04-17 19:56:59 -08:00
parent bea57818a2
commit 6f865b5ff5
16 changed files with 707837 additions and 0 deletions

View File

@ -0,0 +1,12 @@
Title
========================================================
This is an R Markdown document or RMD. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown).
When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document.

261
lesson2/demystifying.R Normal file
View File

@ -0,0 +1,261 @@
# The goal of this file is to introduce you to the
# R programming language. Let's start with by unraveling a
# little mystery!
# 1. Run the code below to create the vector 'udacious'.
# You need to highlight all of the lines of the code and then
# run it. You should see "udacious" appear in the workspace.
udacious <- c("Chris Saden", "Lauren Castellano",
"Sarah Spikes","Dean Eckles",
"Andy Brown", "Moira Burke",
"Kunal Chawla")
# You should see something like "chr[1:7]" in the 'Environment'
# or 'Workspace' tab. This is because you created a 'vector' with
# 7 names that have a 'type' of character. The arrow-like
# '<-' symbol is the assignment operator in R, similar to the
# equal sign '=' in other programming languages. The c() is a
# generic function that combines arguments, in this case the
# names of people, to form a vector.
# A 'vector' is one of the data types in R. Vectors must contain
# the same type of data, that is the entries must all be of the
# same type: character (most programmers call these strings),
# logical (TRUE or FALSE), or numeric.
# Print out the vector udacious by running this next line of code.
udacious
# Notice how there are numbers next to the output.
# Each number corresponds to the index of the entry in the vector.
# Chris Saden is the first entry so [1]
# Dean Eckles is the fourth entry so [4]
# Kunal Chawla is the seventh entry so [7]
# Depending on the size of you window you may see different numbers
# in the output.
# ANOTHER HELPFUL TIP: You can add values to a vector.
# Run each line of code one at a time below to see what is happening.
numbers <- c(1:10)
numbers
numbers <- c(numbers, 11:20)
numbers
# 2. Replace YOUR_NAME with your actual name in the vector
# 'udacious' and run the code. Be sure to use quotes around it.
udacious <- c("Chris Saden", "Lauren Castellano",
"Sarah Spikes","Dean Eckles",
"Andy Brown", "Moira Burke",
"Kunal Chawla", YOUR_NAME)
# Notice how R updates 'udacious' in the workspace.
# It should now say something like 'chr[1:8]'.
# 3. Run the following two lines of code. You can highlight both lines
# of code and run them.
mystery = nchar(udacious)
mystery
# You just created a new vector called mystery. What do you
# think is in this vector? (scroll down for the answer)
# Mystery is a vector that contains the number of characters
# for each of the names in udacious, including your name.
# 4. Run this next line of code.
mystery == 11
# Here we get a logical (or boolean) vector that tells us
# which locations or indices in the vector contain a name
# that has exactly 11 characters.
# 5. Let's use this boolean vector, mystery, to subset our
# udacious vector. What do you think the result will be when
# running the line of code below?
# Think about the output before you run this next line of code.
# Notice how there are brackets in the code. Brackets are often
# used in R for subsetting.
udacious[mystery == 11]
# Scroll down for the answer
# It's your Udacious Instructors for the course!
# (and you may be in the output if you're lucky enough
# to have 11 characters in YOUR_NAME) Either way, we
# think you're pretty udacious for taking this course.
# 6. Alright, all mystery aside...let's dive into some data!
# The R installation has a few datasets already built into it
# that you can play with. Right now, you'll load one of these,
# which is named mtcars.
# Run this next command to load the mtcars data.
data(mtcars)
# You should see mtcars appear in the 'Environment' tab with
# <Promise> listed next to it.
# The object (mtcars) appears as a 'Promise' object in the
# workspace until we run some code that uses the object.
# R has stored the mtcars data into a spreadsheet-like object
# called a data frame. Run the next command to see what variables
# are in the data set and to fully load the data set as an
# object in R. You should see <Promise> disappear when you
# run the next line of code.
# Visit http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects
# if you want the expert insight on Promise objects. You won't
# need to the info on Promise objects to be successful in this course.
names(mtcars)
# names(mtcars) should output all the variable
# names in the data set. You might notice that the car names
# are not a variable in the data set. The car names have been saved
# as row names. More on this later.
# You should also see how many observations (obs.) are in the
# the data frame and the number of variables on each observation.
# 7. To get more information on the data set and the variables
# run the this next line of code.
?mtcars
# You can type a '?' before any command or a data set to learn
# more about it. The details and documentation will appear in
# the 'Help' tab.
# 8. To print out the data, run this next line as code.
mtcars
# Scroll up and down in the console to check out the data.
# This is the entire data frame printed out.
# 9. Run these next two functions, one at a time,
# and see if you can figure out what they do.
str(mtcars)
dim(mtcars)
# Scroll down for the answer.
# The first command, str(mtcars), gives us the structure of the
# data frame. It lists the variable names, the type of each variable
# (all of these variables are numerics) and some values for each
# variable.
# The second command, dim(mtcars), should output '[1] 32 11'
# to the console. The [1] indicates that 32 is the first value
# in the output.
# R uses 1 to start indexing (AND NOT ZERO BASED INDEXING as is true
# of many other programming languages.)
# 10. Read the documentation for row.names if you're want to know more.
?row.names
# Run this code to see the current row names in the data frame.
row.names(mtcars)
# Run this code to change the row names of the cars to numbers.
row.names(mtcars) <- c(1:32)
# Now print out the data frame by running the code below.
mtcars
# It's tedious to relabel our data frame with the right car names
# so let's reload the data set and print out the first ten rows.
data(mtcars)
head(mtcars, 10)
# The head() function prints out the first six rows of a data frame
# by default. Run the code below to see.
head(mtcars)
# I think you'll know what this does.
tail(mtcars, 3)
# 11. We've run nine commands so far:
# c, nchar, data, str, dim, names, row.names, head, and tail.
# All of these commands took some inputs or arguments.
# To determine if a command takes more arguments or to learn
# about any default settings, you can look up the documentation
# using '?' before the command, much like you did to learn about
# the mtcars data set and the row.names
# 12. Let's examine our car data more closely. We can access an
# an individual variable (or column) from the data frame using
# the '$' sign. Run the code below to print out the variable
# miles per gallon. This is the mpg column in the data frame.
mtcars$mpg
# Print out any two other variables to the console.
# This is a vector containing the mpg (miles per gallon) of
# the 32 cars. Run this next line of code to get the average mpg for
# for all the cars. What is it?
# Enter this number for the quiz on the Udacity website.
# https://www.udacity.com/course/viewer#!/c-ud651/l-729069797/e-804129314/m-830829287
mean(mtcars$mpg)

179
lesson2/demystifyingR2.Rmd Normal file
View File

@ -0,0 +1,179 @@
Demystifying R Part 2
========================================================
You might see a warning message just above this file. Something like...
"R Markdown requires the knitr package (version 1.2 or higher)"
Don't worry about this for now. We'll address it at the end of this file.
1. Run the following command to see what it does.
```{r}
summary(mtcars)
```
If you know about quantiles, then the output should look familiar.
If not, you probably recognize the min (minimum), median, mean, and max (maximum).
We'll go over quantiles in Lesson 3 so don't worry if the output seems overwhelming.
The str() and summary() functions are helpful commands when working with a new data set.
The str() function gives us the variable names and their types.
The summary() function gives us an idea of the values a variable can take on.
2. In 2013, the average mpg (miles per gallon) for a car was 23 mpg.
The car models in the mtcars data set come from the year 1973-1974.
Subset the data so that you create a new data frame that contains
cars that get 23 or more mpg (miles per gallon). Save it to a new data
frame called efficient.
```{r}
```
3. How many cars get more than 23 mpg? Use one of the commands you
learned in the demystifying.R to answer this question.
```{r}
```
4. We can also use logical operators to find out which car(s) get greater
than 30 miles per gallon (mpg) and have more than 100 raw horsepower.
```{r}
subset(mtcars, mpg > 30 & hp > 100)
```
There's only one car that gets more than 30 mpg and 100 hp.
5. What do you think this code does? Scroll down for the answer.
```{r}
subset(mtcars, mpg < 14 | disp > 390)
```
Note: You may be familiar with the || operator in Java. R uses one single & for the logical
operator AND. It also uses one | for the logical operator OR.
The command above creates a data frame of cars that have mpg less than 14
OR a displacement of more than 390. Only one of the conditions for a car
needs to be satisfied so that the car makes it into the subset. Any of the
cars that fit the criteria are printed to the console.
Now you try some.
6. Print the cars that have a 1/4 mile time (qsec) less than or equal to
16.90 seconds to the console.
```{r}
```
7. Save the subset of cars that weigh under 2000 pounds (weight is measured in lb/1000)
to a variable called lightCars. Print the numbers of cars and the subset to the console.
```{r}
```
8. You can also create new variables in a data frame. Let's say you wanted
to have the year of each car's model. We can create the variable
mtcars$year. Here we'll assume that all of the models were from 1974.
Run the code below.
```{r}
mtcars$year <- 1974
```
Notice how the number of variables changed in the work space. You can
also see the result by double clicking on mtcars in the workspace and
examining the data in a table.
To drop a variable, subset the data frame and select the variable you
want to drop with a negative sign in front of it.
```{r}
mtcars <- subset(mtcars, select = -year)
```
Notice, we are back to 11 variables in the data frame.
9. What do you think this code does? Run it to find out.
```{r}
mtcars$year <- c(1973, 1974)
```
Open the table of values to see what values year takes on.
Drop the year variable from the data set.
```{r}
```
10. Now you are going to get a preview of ifelse(). For those new
to programming this example may be confusing. See if you can understand
the code by running the commands one line at a time. Read the output and
make sense of what the code is doing at each step.
If you are having trouble don't worry, we will review the ifelse statement
at the end of Lesson 3. You won't be quizzed on it, and it's not essential
to keep going in this course. We just want you to try to get familiar with
more code.
```{r}
mtcars$wt
cond <- mtcars$wt < 3
cond
mtcars$weight_class <- ifelse(cond, 'light', 'average')
mtcars$weight_class
cond <- mtcars$wt > 3.5
mtcars$weight_class <- ifelse(cond, 'heavy', mtcars$weight_class)
mtcars$weight_class
```
You have some variables in your workspace or environment like 'cond' and
efficient. You want to be careful that you don't bring in too much data
into R at once since R will hold all the data in working memory. We have
nothing to worry about here, but let's delete those variables from the
work space.
```{r}
rm(cond)
rm(efficient)
```
Save this file if you haven't done so yet.
You'll have the opportunity to create one Rmd file for the final project in
this class and submit the Rmd file and knitted output (or HTML file). You'll
need the knitr package to do that so let's install that now. **Uncomment** the
following two lines of code and run them.
```{r}
# install.packages('knitr', dependencies = T)
# library(knitr)
```
Once you've installed knitr, **comment** out the two lines of code above.
When you click the **Knit HTML** button a web page will be generated that
includes both content (text and text formatting from Markdown) as well as
the output of any embedded R code chunks within the document.
You've reached the end of the file so now it's time to write some code to
answer a question to continue on in Lesson 2.
Which car(s) have an mpg (miles per gallon) greater than or equal to 30
OR hp (horsepower) less than 60? Create an R chunk of code to answer the question.
Once you have the answer, go the [Udacity website](https://www.udacity.com/course/viewer#!/c-ud651/l-729069797/e-804129319/m-811719066) to continue with Lesson 2.
Note: You use brackets around text followed by two parentheses to create a link.
There must be no spaces between the brackets and the parentheses. Paste or type
the link into the parentheses. This also works on the discussions!
And if you want to see all of your HARD WORK from this file, click
the **KNIT HTML** button now. (You may or may not need to restart R).
# CONGRATULATIONS
#### You'll be exploring data soon with your new knowledge of R.

1
lesson2/reddit.csv Normal file

File diff suppressed because one or more lines are too long

51
lesson2/stateData.csv Normal file
View File

@ -0,0 +1,51 @@
"","state.abb","state.area","state.region","population","income","illiteracy","life.exp","murder","highSchoolGrad","frost","area"
"Alabama","AL","51609","2","3615","3624","2.1","69.05","15.1","41.3","20","50708"
"Alaska","AK","589757","4","365","6315","1.5","69.31","11.3","66.7","152","566432"
"Arizona","AZ","113909","4","2212","4530","1.8","70.55","7.8","58.1","15","113417"
"Arkansas","AR","53104","2","2110","3378","1.9","70.66","10.1","39.9","65","51945"
"California","CA","158693","4","21198","5114","1.1","71.71","10.3","62.6","20","156361"
"Colorado","CO","104247","4","2541","4884","0.7","72.06","6.8","63.9","166","103766"
"Connecticut","CT","5009","1","3100","5348","1.1","72.48","3.1","56","139","4862"
"Delaware","DE","2057","2","579","4809","0.9","70.06","6.2","54.6","103","1982"
"Florida","FL","58560","2","8277","4815","1.3","70.66","10.7","52.6","11","54090"
"Georgia","GA","58876","2","4931","4091","2","68.54","13.9","40.6","60","58073"
"Hawaii","HI","6450","4","868","4963","1.9","73.6","6.2","61.9","0","6425"
"Idaho","ID","83557","4","813","4119","0.6","71.87","5.3","59.5","126","82677"
"Illinois","IL","56400","3","11197","5107","0.9","70.14","10.3","52.6","127","55748"
"Indiana","IN","36291","3","5313","4458","0.7","70.88","7.1","52.9","122","36097"
"Iowa","IA","56290","3","2861","4628","0.5","72.56","2.3","59","140","55941"
"Kansas","KS","82264","3","2280","4669","0.6","72.58","4.5","59.9","114","81787"
"Kentucky","KY","40395","2","3387","3712","1.6","70.1","10.6","38.5","95","39650"
"Louisiana","LA","48523","2","3806","3545","2.8","68.76","13.2","42.2","12","44930"
"Maine","ME","33215","1","1058","3694","0.7","70.39","2.7","54.7","161","30920"
"Maryland","MD","10577","2","4122","5299","0.9","70.22","8.5","52.3","101","9891"
"Massachusetts","MA","8257","1","5814","4755","1.1","71.83","3.3","58.5","103","7826"
"Michigan","MI","58216","3","9111","4751","0.9","70.63","11.1","52.8","125","56817"
"Minnesota","MN","84068","3","3921","4675","0.6","72.96","2.3","57.6","160","79289"
"Mississippi","MS","47716","2","2341","3098","2.4","68.09","12.5","41","50","47296"
"Missouri","MO","69686","3","4767","4254","0.8","70.69","9.3","48.8","108","68995"
"Montana","MT","147138","4","746","4347","0.6","70.56","5","59.2","155","145587"
"Nebraska","NE","77227","3","1544","4508","0.6","72.6","2.9","59.3","139","76483"
"Nevada","NV","110540","4","590","5149","0.5","69.03","11.5","65.2","188","109889"
"New Hampshire","NH","9304","1","812","4281","0.7","71.23","3.3","57.6","174","9027"
"New Jersey","NJ","7836","1","7333","5237","1.1","70.93","5.2","52.5","115","7521"
"New Mexico","NM","121666","4","1144","3601","2.2","70.32","9.7","55.2","120","121412"
"New York","NY","49576","1","18076","4903","1.4","70.55","10.9","52.7","82","47831"
"North Carolina","NC","52586","2","5441","3875","1.8","69.21","11.1","38.5","80","48798"
"North Dakota","ND","70665","3","637","5087","0.8","72.78","1.4","50.3","186","69273"
"Ohio","OH","41222","3","10735","4561","0.8","70.82","7.4","53.2","124","40975"
"Oklahoma","OK","69919","2","2715","3983","1.1","71.42","6.4","51.6","82","68782"
"Oregon","OR","96981","4","2284","4660","0.6","72.13","4.2","60","44","96184"
"Pennsylvania","PA","45333","1","11860","4449","1","70.43","6.1","50.2","126","44966"
"Rhode Island","RI","1214","1","931","4558","1.3","71.9","2.4","46.4","127","1049"
"South Carolina","SC","31055","2","2816","3635","2.3","67.96","11.6","37.8","65","30225"
"South Dakota","SD","77047","3","681","4167","0.5","72.08","1.7","53.3","172","75955"
"Tennessee","TN","42244","2","4173","3821","1.7","70.11","11","41.8","70","41328"
"Texas","TX","267339","2","12237","4188","2.2","70.9","12.2","47.4","35","262134"
"Utah","UT","84916","4","1203","4022","0.6","72.9","4.5","67.3","137","82096"
"Vermont","VT","9609","1","472","3907","0.6","71.64","5.5","57.1","168","9267"
"Virginia","VA","40815","2","4981","4701","1.4","70.08","9.5","47.8","85","39780"
"Washington","WA","68192","4","3559","4864","0.6","71.72","4.3","63.5","32","66570"
"West Virginia","WV","24181","2","1799","3617","1.4","69.48","6.7","41.6","100","24070"
"Wisconsin","WI","56154","3","4589","4468","0.7","72.48","3","54.5","149","54464"
"Wyoming","WY","97914","4","376","4566","0.6","70.29","6.9","62.9","173","97203"
1 state.abb state.area state.region population income illiteracy life.exp murder highSchoolGrad frost area
2 Alabama AL 51609 2 3615 3624 2.1 69.05 15.1 41.3 20 50708
3 Alaska AK 589757 4 365 6315 1.5 69.31 11.3 66.7 152 566432
4 Arizona AZ 113909 4 2212 4530 1.8 70.55 7.8 58.1 15 113417
5 Arkansas AR 53104 2 2110 3378 1.9 70.66 10.1 39.9 65 51945
6 California CA 158693 4 21198 5114 1.1 71.71 10.3 62.6 20 156361
7 Colorado CO 104247 4 2541 4884 0.7 72.06 6.8 63.9 166 103766
8 Connecticut CT 5009 1 3100 5348 1.1 72.48 3.1 56 139 4862
9 Delaware DE 2057 2 579 4809 0.9 70.06 6.2 54.6 103 1982
10 Florida FL 58560 2 8277 4815 1.3 70.66 10.7 52.6 11 54090
11 Georgia GA 58876 2 4931 4091 2 68.54 13.9 40.6 60 58073
12 Hawaii HI 6450 4 868 4963 1.9 73.6 6.2 61.9 0 6425
13 Idaho ID 83557 4 813 4119 0.6 71.87 5.3 59.5 126 82677
14 Illinois IL 56400 3 11197 5107 0.9 70.14 10.3 52.6 127 55748
15 Indiana IN 36291 3 5313 4458 0.7 70.88 7.1 52.9 122 36097
16 Iowa IA 56290 3 2861 4628 0.5 72.56 2.3 59 140 55941
17 Kansas KS 82264 3 2280 4669 0.6 72.58 4.5 59.9 114 81787
18 Kentucky KY 40395 2 3387 3712 1.6 70.1 10.6 38.5 95 39650
19 Louisiana LA 48523 2 3806 3545 2.8 68.76 13.2 42.2 12 44930
20 Maine ME 33215 1 1058 3694 0.7 70.39 2.7 54.7 161 30920
21 Maryland MD 10577 2 4122 5299 0.9 70.22 8.5 52.3 101 9891
22 Massachusetts MA 8257 1 5814 4755 1.1 71.83 3.3 58.5 103 7826
23 Michigan MI 58216 3 9111 4751 0.9 70.63 11.1 52.8 125 56817
24 Minnesota MN 84068 3 3921 4675 0.6 72.96 2.3 57.6 160 79289
25 Mississippi MS 47716 2 2341 3098 2.4 68.09 12.5 41 50 47296
26 Missouri MO 69686 3 4767 4254 0.8 70.69 9.3 48.8 108 68995
27 Montana MT 147138 4 746 4347 0.6 70.56 5 59.2 155 145587
28 Nebraska NE 77227 3 1544 4508 0.6 72.6 2.9 59.3 139 76483
29 Nevada NV 110540 4 590 5149 0.5 69.03 11.5 65.2 188 109889
30 New Hampshire NH 9304 1 812 4281 0.7 71.23 3.3 57.6 174 9027
31 New Jersey NJ 7836 1 7333 5237 1.1 70.93 5.2 52.5 115 7521
32 New Mexico NM 121666 4 1144 3601 2.2 70.32 9.7 55.2 120 121412
33 New York NY 49576 1 18076 4903 1.4 70.55 10.9 52.7 82 47831
34 North Carolina NC 52586 2 5441 3875 1.8 69.21 11.1 38.5 80 48798
35 North Dakota ND 70665 3 637 5087 0.8 72.78 1.4 50.3 186 69273
36 Ohio OH 41222 3 10735 4561 0.8 70.82 7.4 53.2 124 40975
37 Oklahoma OK 69919 2 2715 3983 1.1 71.42 6.4 51.6 82 68782
38 Oregon OR 96981 4 2284 4660 0.6 72.13 4.2 60 44 96184
39 Pennsylvania PA 45333 1 11860 4449 1 70.43 6.1 50.2 126 44966
40 Rhode Island RI 1214 1 931 4558 1.3 71.9 2.4 46.4 127 1049
41 South Carolina SC 31055 2 2816 3635 2.3 67.96 11.6 37.8 65 30225
42 South Dakota SD 77047 3 681 4167 0.5 72.08 1.7 53.3 172 75955
43 Tennessee TN 42244 2 4173 3821 1.7 70.11 11 41.8 70 41328
44 Texas TX 267339 2 12237 4188 2.2 70.9 12.2 47.4 35 262134
45 Utah UT 84916 4 1203 4022 0.6 72.9 4.5 67.3 137 82096
46 Vermont VT 9609 1 472 3907 0.6 71.64 5.5 57.1 168 9267
47 Virginia VA 40815 2 4981 4701 1.4 70.08 9.5 47.8 85 39780
48 Washington WA 68192 4 3559 4864 0.6 71.72 4.3 63.5 32 66570
49 West Virginia WV 24181 2 1799 3617 1.4 69.48 6.7 41.6 100 24070
50 Wisconsin WI 56154 3 4589 4468 0.7 72.48 3 54.5 149 54464
51 Wyoming WY 97914 4 376 4566 0.6 70.29 6.9 62.9 173 97203

283
lesson3/lesson3_student.rmd Normal file
View File

@ -0,0 +1,283 @@
Lesson 3
========================================================
***
### What to Do First?
Notes:
***
### Pseudo-Facebook User Data
Notes:
```{r Pseudo-Facebook User Data}
```
***
### Histogram of Users' Birthdays
Notes:
```{r Histogram of Users\' Birthdays}
install.packages('ggplot2')
library(ggplot2)
```
***
#### What are some things that you notice about this histogram?
Response:
***
### Moira's Investigation
Notes:
***
### Estimating Your Audience Size
Notes:
***
#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
Response:
#### How many of your friends do you think saw that post?
Response:
#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
Response:
***
### Perceived Audience Size
Notes:
***
### Faceting
Notes:
```{r Faceting}
```
#### Lets take another look at our plot. What stands out to you here?
Response:
***
### Be Skeptical - Outliers and Anomalies
Notes:
***
### Moira's Outlier
Notes:
#### Which case do you think applies to Moiras outlier?
Response:
***
### Friend Count
Notes:
#### What code would you enter to create a histogram of friend counts?
```{r Friend Count}
```
#### How is this plot similar to Moira's first plot?
Response:
***
### Limiting the Axes
Notes:
```{r Limiting the Axes}
```
### Exploring with Bin Width
Notes:
***
### Adjusting the Bin Width
Notes:
### Faceting Friend Count
```{r Faceting Friend Count}
# What code would you add to create a facet the histogram by gender?
# Add it to the code below.
qplot(x = friend_count, data = pf, binwidth = 10) +
scale_x_continuous(limits = c(0, 1000),
breaks = seq(0, 1000, 50))
```
***
### Omitting NA Values
Notes:
```{r Omitting NA Values}
```
***
### Statistics 'by' Gender
Notes:
```{r Statistics \'by\' Gender}
```
#### Who on average has more friends: men or women?
Response:
#### What's the difference between the median friend count for women and men?
Response:
#### Why would the median be a better measure than the mean?
Response:
***
### Tenure
Notes:
```{r Tenure}
```
***
#### How would you create a histogram of tenure by year?
```{r Tenure Histogram by Year}
```
***
### Labeling Plots
Notes:
```{r Labeling Plots}
```
***
### User Ages
Notes:
```{r User Ages}
```
#### What do you notice?
Response:
***
### The Spread of Memes
Notes:
***
### Lada's Money Bag Meme
Notes:
***
### Transforming Data
Notes:
***
### Add a Scaling Layer
Notes:
```{r Add a Scaling Layer}
```
***
### Frequency Polygons
```{r Frequency Polygons}
```
***
### Likes on the Web
Notes:
```{r Likes on the Web}
```
***
### Box Plots
Notes:
```{r Box Plots}
```
#### Adjust the code to focus on users who have friend counts between 0 and 1000.
```{r}
```
***
### Box Plots, Quartiles, and Friendships
Notes:
```{r Box Plots, Quartiles, and Friendships}
```
#### On average, who initiated more friendships in our sample: men or women?
Response:
#### Write about some ways that you can verify your answer.
Response:
```{r Friend Requests by Gender}
```
Response:
***
### Getting Logical
Notes:
```{r Getting Logical}
```
Response:
***
### Analyzing One Variable
Reflection:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!

99004
lesson3/pseudo_facebook.tsv Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 KiB

268
lesson4/lesson4_student.rmd Normal file
View File

@ -0,0 +1,268 @@
Lesson 4
========================================================
***
### Scatterplots and Perceived Audience Size
Notes:
***
### Scatterplots
Notes:
```{r Scatterplots}
```
***
#### What are some things that you notice right away?
Response:
***
### ggplot Syntax
Notes:
```{r ggplot Syntax}
```
***
### Overplotting
Notes:
```{r Overplotting}
```
#### What do you notice in the plot?
Response:
***
### Coord_trans()
Notes:
```{r Coord_trans()}
```
#### Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!
```{r}
```
#### What do you notice?
***
### Alpha and Jitter
Notes:
```{r Alpha and Jitter}
```
***
### Overplotting and Domain Knowledge
Notes:
***
### Conditional Means
Notes:
```{r Conditional Means}
```
Create your plot!
```{r Conditional Means Plot}
```
***
### Overlaying Summaries with Raw Data
Notes:
```{r Overlaying Summaries with Raw Data}
```
#### What are some of your observations of the plot?
Response:
***
### Moira: Histogram Summary and Scatterplot
See the Instructor Notes of this video to download Moira's paper on perceived audience size and to see the final plot.
Notes:
***
### Correlation
Notes:
```{r Correlation}
```
Look up the documentation for the cor.test function.
What's the correlation between age and friend count? Round to three decimal places.
Response:
***
### Correlation on Subsets
Notes:
```{r Correlation on Subsets}
with( , cor.test(age, friend_count))
```
***
### Correlation Methods
Notes:
***
## Create Scatterplots
Notes:
```{r}
```
***
### Strong Correlations
Notes:
```{r Strong Correlations}
```
What's the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
```{r Correlation Calcuation}
```
Response:
***
### Moira on Correlation
Notes:
***
### More Caution with Correlation
Notes:
```{r More Caution With Correlation}
install.packages('alr3')
library(alr3)
```
Create your plot!
```{r Temp vs Month}
```
***
### Noisy Scatterplots
a. Take a guess for the correlation coefficient for the scatterplot.
b. What is the actual correlation of the two variables?
(Round to the thousandths place)
```{r Noisy Scatterplots}
```
***
### Making Sense of Data
Notes:
```{r Making Sense of Data}
```
***
### A New Perspective
What do you notice?
Response:
Watch the solution video and check out the Instructor Notes!
Notes:
***
### Understanding Noise: Age to Age Months
Notes:
```{r Understanding Noise: Age to Age Months}
```
***
### Age with Months Means
```{r Age with Months Means}
```
Programming Assignment
```{r Programming Assignment}
```
***
### Noise in Conditional Means
```{r Noise in Conditional Means}
```
***
### Smoothing Conditional Means
Notes:
```{r Smoothing Conditional Means}
```
***
### Which Plot to Choose?
Notes:
***
### Analyzing Two Variables
Reflection:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!

253
lesson5/lesson5_student.rmd Normal file
View File

@ -0,0 +1,253 @@
Lesson 5
========================================================
### Multivariate Data
Notes:
***
### Moira Perceived Audience Size Colored by Age
Notes:
***
### Third Qualitative Variable
Notes:
```{r Third Qualitative Variable}
ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) + geom_histogram()
```
***
### Plotting Conditional Summaries
Notes:
```{r Plotting Conditional Summaries}
```
***
### Thinking in Ratios
Notes:
***
### Wide and Long Format
Notes:
***
### Reshaping Data
Notes:
```{r}
install.packages('reshape2')
library(reshape2)
```
***
### Ratio Plot
Notes:
```{r Ratio Plot}
```
***
### Third Quantitative Variable
Notes:
```{r Third Quantitative Variable}
```
***
### Cut a Variable
Notes:
```{r Cut a Variable}
```
***
### Plotting it All Together
Notes:
```{r Plotting it All Together}
```
***
### Plot the Grand Mean
Notes:
```{r Plot the Grand Mean}
```
***
### Friending Rate
Notes:
```{r Friending Rate}
```
***
### Friendships Initiated
Notes:
What is the median friend rate?
What is the maximum friend rate?
```{r Friendships Initiated}
```
***
### Bias-Variance Tradeoff Revisited
Notes:
```{r Bias-Variance Tradeoff Revisited}
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
```
***
### Sean's NFL Fan Sentiment Study
Notes:
***
### Introducing the Yogurt Data Set
Notes:
***
### Histograms Revisited
Notes:
```{r Histograms Revisited}
```
***
### Number of Purchases
Notes:
```{r Number of Purchases}
```
***
### Prices over Time
Notes:
```{r Prices over Time}
```
***
### Sampling Observations
Notes:
***
### Looking at Samples of Households
```{r Looking at Sample of Households}
```
***
### The Limits of Cross Sectional Data
Notes:
***
### Many Variables
Notes:
***
### Scatterplot Matrix
Notes:
***
### Even More Variables
Notes:
***
### Heat Maps
Notes:
```{r}
nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)
```
```{r}
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
```
***
### Analyzing Three of More Variables
Reflection:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!

6830
lesson5/nci.tsv Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

2381
lesson5/yogurt.csv Normal file

File diff suppressed because it is too large Load Diff

598025
lesson6/diamondsbig.csv Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

289
lesson6/lesson6_student.rmd Normal file
View File

@ -0,0 +1,289 @@
Lesson 6
========================================================
### Welcome
Notes:
***
### Scatterplot Review
```{r Scatterplot Review}
```
***
### Price and Carat Relationship
Response:
***
### Frances Gerety
Notes:
#### A diamonds is
***
### The Rise of Diamonds
Notes:
***
### ggpairs Function
Notes:
```{r ggpairs Function}
# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
```
What are some things you notice in the ggpairs output?
Response:
***
### The Demand of Diamonds
Notes:
```{r The Demand of Diamonds}
```
***
### Connecting Demand and Price Distributions
Notes:
***
### Scatterplot Transformation
```{r Scatterplot Transformation}
```
### Create a new function to transform the carat variable
```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
#### Use the cuberoot_trans function
```{r Use cuberoot_trans}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### Overplotting Revisited
```{r Sort and Head Tables}
```
```{r Overplotting Revisited}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### Other Qualitative Factors
Notes:
***
### Price vs. Carat and Clarity
Alter the code below.
```{r Price vs. Carat and Clarity}
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Clarity and Price
Response:
***
### Price vs. Carat and Cut
Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Cut and Price
Response:
***
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = Cut, reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
```
***
### Color and Price
Response:
***
### Linear Models in R
Notes:
Response:
***
### Building the Linear Model
Notes:
```{r Building the Linear Model}
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```
Notice how adding cut to our model does not help explain much of the variance
in the price of diamonds. This fits with out exploration earlier.
***
### Model Problems
Video Notes:
Research:
(Take some time to come up with 2-4 problems for the model)
(You should 10-20 min on this)
Response:
***
### A Bigger, Better Data Set
Notes:
```{r A Bigger, Better Data Set}
install.package('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))
```
The code used to obtain the data is available here:
https://github.com/solomonm/diamonds-data
## Building a Model Using the Big Diamonds Data Set
Notes:
```{r Building a Model Using the Big Diamonds Data Set}
```
***
## Predictions
Example Diamond from BlueNile:
Round 1.00 Very Good I VS1 $5,601
```{r}
#Be sure youve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
```
Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
***
## Final Thoughts
Notes:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!