Initial Commit with Project Code

2018-04-17 19:56:59 -08:00 · 2018-04-17 19:56:59 -08:00 · 6f865b5ff5
commit 6f865b5ff5
parent bea57818a2
16 changed files with 707837 additions and 0 deletions
--- a/lesson2/What_is_a_RMD_file.Rmd
+++ b/lesson2/What_is_a_RMD_file.Rmd
@ -0,0 +1,12 @@
+Title
+========================================================
+
+This is an R Markdown document or RMD. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown).
+
+
+
+
+
+
+
+When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document.
--- a/lesson2/demystifying.R
+++ b/lesson2/demystifying.R
@ -0,0 +1,261 @@
+# The goal of this file is to introduce you to the
+# R programming language. Let's start with by unraveling a
+# little mystery!
+
+# 1. Run the code below to create the vector 'udacious'.
+# You need to highlight all of the lines of the code and then
+# run it. You should see "udacious" appear in the workspace.
+
+udacious <- c("Chris Saden", "Lauren Castellano",
+              "Sarah Spikes","Dean Eckles",
+              "Andy Brown", "Moira Burke",
+              "Kunal Chawla")
+
+# You should see something like "chr[1:7]" in the 'Environment'
+# or 'Workspace' tab. This is because you created a 'vector' with
+# 7 names that have a 'type' of character. The arrow-like
+# '<-' symbol is the assignment operator in R, similar to the
+# equal sign '=' in other programming languages. The c() is a
+# generic function that combines arguments, in this case the
+# names of people, to form a vector.
+
+# A 'vector' is one of the data types in R. Vectors must contain
+# the same type of data, that is the entries must all be of the
+# same type: character (most programmers call these strings),
+# logical (TRUE or FALSE), or numeric.
+
+# Print out the vector udacious by running this next line of code.
+
+udacious
+
+# Notice how there are numbers next to the output.
+# Each number corresponds to the index of the entry in the vector.
+# Chris Saden is the first entry so [1]
+# Dean Eckles is the fourth entry so [4]
+# Kunal Chawla is the seventh entry so [7]
+
+# Depending on the size of you window you may see different numbers
+# in the output.
+
+# ANOTHER HELPFUL TIP: You can add values to a vector.
+# Run each line of code one at a time below to see what is happening.
+
+numbers <- c(1:10)
+
+numbers
+
+numbers <- c(numbers, 11:20)
+
+numbers
+
+
+# 2. Replace YOUR_NAME with your actual name in the vector
+# 'udacious' and run the code. Be sure to use quotes around it.
+
+udacious <- c("Chris Saden", "Lauren Castellano",
+              "Sarah Spikes","Dean Eckles",
+              "Andy Brown", "Moira Burke",
+              "Kunal Chawla", YOUR_NAME)
+
+# Notice how R updates 'udacious' in the workspace.
+# It should now say something like 'chr[1:8]'.
+
+# 3. Run the following two lines of code. You can highlight both lines
+# of code and run them.
+
+mystery = nchar(udacious)
+mystery
+
+# You just created a new vector called mystery. What do you
+# think is in this vector? (scroll down for the answer)
+
+
+
+
+
+
+
+
+# Mystery is a vector that contains the number of characters
+# for each of the names in udacious, including your name.
+
+# 4. Run this next line of code.
+
+mystery == 11
+
+# Here we get a logical (or boolean) vector that tells us
+# which locations or indices in the vector contain a name
+# that has exactly 11 characters.
+
+# 5. Let's use this boolean vector, mystery, to subset our
+# udacious vector. What do you think the result will be when
+# running the line of code below?
+
+# Think about the output before you run this next line of code.
+# Notice how there are brackets in the code. Brackets are often
+# used in R for subsetting.
+
+udacious[mystery == 11]
+
+
+# Scroll down for the answer
+
+
+
+
+
+
+
+
+
+# It's your Udacious Instructors for the course!
+# (and you may be in the output if you're lucky enough
+# to have 11 characters in YOUR_NAME) Either way, we
+# think you're pretty udacious for taking this course.
+
+
+
+
+
+# 6. Alright, all mystery aside...let's dive into some data!
+# The R installation has a few datasets already built into it
+# that you can play with. Right now, you'll load one of these,
+# which is named mtcars.
+
+# Run this next command to load the mtcars data.
+
+data(mtcars)
+
+
+# You should see mtcars appear in the 'Environment' tab with
+# <Promise> listed next to it. 
+
+# The object (mtcars) appears as a 'Promise' object in the
+# workspace until we run some code that uses the object.
+
+# R has stored the mtcars data into a spreadsheet-like object
+# called a data frame. Run the next command to see what variables
+# are in the data set and to fully load the data set as an
+# object in R. You should see <Promise> disappear when you
+# run the next line of code.
+
+# Visit http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects
+# if you want the expert insight on Promise objects. You won't
+# need to the info on Promise objects to be successful in this course.
+
+names(mtcars)
+
+# names(mtcars) should output all the variable
+# names in the data set. You might notice that the car names
+# are not a variable in the data set. The car names have been saved
+# as row names. More on this later.
+
+# You should also see how many observations (obs.) are in the
+# the data frame and the number of variables on each observation.
+
+# 7. To get more information on the data set and the variables
+# run the this next line of code.
+
+?mtcars
+
+# You can type a '?' before any command or a data set to learn
+# more about it. The details and documentation will appear in
+# the 'Help' tab.
+
+
+# 8. To print out the data, run this next line as code.
+
+mtcars
+
+# Scroll up and down in the console to check out the data.
+# This is the entire data frame printed out.
+
+# 9. Run these next two functions, one at a time,
+# and see if you can figure out what they do.
+
+str(mtcars)
+
+dim(mtcars)
+
+# Scroll down for the answer.
+
+
+
+
+
+
+
+
+
+# The first command, str(mtcars), gives us the structure of the
+# data frame. It lists the variable names, the type of each variable
+# (all of these variables are numerics) and some values for each
+# variable.
+
+
+# The second command, dim(mtcars), should output '[1] 32 11'
+# to the console. The [1] indicates that 32 is the first value
+# in the output.
+
+# R uses 1 to start indexing (AND NOT ZERO BASED INDEXING as is true
+# of many other programming languages.)
+
+# 10. Read the documentation for row.names if you're want to know more.
+?row.names
+
+# Run this code to see the current row names in the data frame.
+row.names(mtcars)
+
+# Run this code to change the row names of the cars to numbers.
+row.names(mtcars) <- c(1:32)
+
+# Now print out the data frame by running the code below.
+mtcars
+
+# It's tedious to relabel our data frame with the right car names
+# so let's reload the data set and print out the first ten rows.
+
+data(mtcars)
+head(mtcars, 10)
+
+# The head() function prints out the first six rows of a data frame
+# by default. Run the code below to see.
+head(mtcars)
+
+# I think you'll know what this does.
+tail(mtcars, 3)
+
+
+# 11. We've run nine commands so far:
+#      c, nchar, data, str, dim, names, row.names, head, and tail.
+
+# All of these commands took some inputs or arguments.
+# To determine if a command takes more arguments or to learn
+# about any default settings, you can look up the documentation
+# using '?' before the command, much like you did to learn about
+# the mtcars data set and the row.names
+
+
+
+# 12. Let's examine our car data more closely. We can access an
+# an individual variable (or column) from the data frame using
+# the '$' sign. Run the code below to print out the variable
+# miles per gallon. This is the mpg column in the data frame.
+
+mtcars$mpg
+
+# Print out any two other variables to the console.
+
+
+
+# This is a vector containing the mpg (miles per gallon) of
+# the 32 cars. Run this next line of code to get the average mpg for
+# for all the cars. What is it?
+
+# Enter this number for the quiz on the Udacity website.
+# https://www.udacity.com/course/viewer#!/c-ud651/l-729069797/e-804129314/m-830829287
+
+mean(mtcars$mpg)
+
+
+
--- a/lesson2/demystifyingR2.Rmd
+++ b/lesson2/demystifyingR2.Rmd
@ -0,0 +1,179 @@
+Demystifying R Part 2
+========================================================
+
+You might see a warning message just above this file. Something like...
+"R Markdown requires the knitr package (version 1.2 or higher)"
+Don't worry about this for now. We'll address it at the end of this file.
+
+1. Run the following command to see what it does.
+```{r}
+summary(mtcars)
+```
+
+If you know about quantiles, then the output should look familiar.
+If not, you probably recognize the min (minimum), median, mean, and max (maximum).
+We'll go over quantiles in Lesson 3 so don't worry if the output seems overwhelming.
+
+The str() and summary() functions are helpful commands when working with a new data set.
+The str() function gives us the variable names and their types.
+The summary() function gives us an idea of the values a variable can take on.
+
+2. In 2013, the average mpg (miles per gallon) for a car was 23 mpg.
+The car models in the mtcars data set come from the year 1973-1974.
+Subset the data so that you create a new data frame that contains
+cars that get 23 or more mpg (miles per gallon). Save it to a new data
+frame called efficient.
+```{r}
+
+```
+
+3. How many cars get more than 23 mpg? Use one of the commands you
+learned in the demystifying.R to answer this question.
+```{r}
+
+```
+
+4. We can also use logical operators to find out which car(s) get greater
+than 30 miles per gallon (mpg) and have more than 100 raw horsepower.
+```{r}
+subset(mtcars, mpg > 30 & hp > 100)
+```
+
+There's only one car that gets more than 30 mpg and 100 hp.
+
+5. What do you think this code does? Scroll down for the answer.
+```{r}
+subset(mtcars, mpg < 14 | disp > 390)
+```
+
+Note: You may be familiar with the || operator in Java. R uses one single & for the logical
+operator AND. It also uses one | for the logical operator OR.
+
+
+
+
+
+
+
+
+The command above creates a data frame of cars that have mpg less than 14
+OR a displacement of more than 390. Only one of the conditions for a car
+needs to be satisfied so that the car makes it into the subset. Any of the
+cars that fit the criteria are printed to the console.
+
+Now you try some.
+
+6. Print the cars that have a 1/4 mile time (qsec) less than or equal to
+16.90 seconds to the console.
+```{r}
+
+```
+
+7. Save the subset of cars that weigh under 2000 pounds (weight is measured in lb/1000)
+to a variable called lightCars. Print the numbers of cars and the subset to the console.
+```{r}
+
+```
+
+8. You can also create new variables in a data frame. Let's say you wanted
+to have the year of each car's model. We can create the variable
+mtcars$year. Here we'll assume that all of the models were from 1974.
+Run the code below.
+```{r}
+mtcars$year <- 1974
+```
+
+Notice how the number of variables changed in the work space. You can
+also see the result by double clicking on mtcars in the workspace and
+examining the data in a table.
+
+To drop a variable, subset the data frame and select the variable you
+want to drop with a negative sign in front of it.
+```{r}
+mtcars <- subset(mtcars, select = -year)
+```
+
+Notice, we are back to 11 variables in the data frame.
+
+9. What do you think this code does? Run it to find out.
+```{r}
+mtcars$year <- c(1973, 1974)
+```
+
+Open the table of values to see what values year takes on.
+
+Drop the year variable from the data set.
+```{r}
+
+```
+
+
+10. Now you are going to get a preview of ifelse(). For those new
+to programming this example may be confusing. See if you can understand
+the code by running the commands one line at a time. Read the output and
+make sense of what the code is doing at each step.
+
+If you are having trouble don't worry, we will review the ifelse statement
+at the end of Lesson 3. You won't be quizzed on it, and it's not essential
+to keep going in this course. We just want you to try to get familiar with
+more code.
+```{r}
+mtcars$wt
+cond <- mtcars$wt < 3
+cond
+mtcars$weight_class <- ifelse(cond, 'light', 'average')
+mtcars$weight_class
+cond <- mtcars$wt > 3.5
+mtcars$weight_class <- ifelse(cond, 'heavy', mtcars$weight_class)
+mtcars$weight_class
+```
+
+You have some variables in your workspace or environment like 'cond' and
+efficient. You want to be careful that you don't bring in too much data
+into R at once since R will hold all the data in working memory. We have
+nothing to worry about here, but let's delete those variables from the
+work space.
+
+```{r}
+rm(cond)
+rm(efficient)
+```
+
+Save this file if you haven't done so yet.
+
+
+You'll have the opportunity to create one Rmd file for the final project in
+this class and submit the Rmd file and knitted output (or HTML file). You'll
+need the knitr package to do that so let's install that now. **Uncomment** the
+following two lines of code and run them.
+
+```{r}
+# install.packages('knitr', dependencies = T)
+# library(knitr)
+```
+
+Once you've installed knitr, **comment** out the two lines of code above.
+When you click the **Knit HTML** button a web page will be generated that
+includes both content (text and text formatting from Markdown) as well as
+the output of any embedded R code chunks within the document.
+
+
+You've reached the end of the file so now it's time to write some code to
+answer a question to continue on in Lesson 2.
+
+Which car(s) have an mpg (miles per gallon) greater than or equal to 30
+OR hp (horsepower) less than 60? Create an R chunk of code to answer the question.
+
+
+
+Once you have the answer, go the [Udacity website](https://www.udacity.com/course/viewer#!/c-ud651/l-729069797/e-804129319/m-811719066) to continue with Lesson 2.
+
+Note: You use brackets around text followed by two parentheses to create a link.
+There must be no spaces between the brackets and the parentheses. Paste or type
+the link into the parentheses. This also works on the discussions!
+
+And if you want to see all of your HARD WORK from this file, click
+the **KNIT HTML** button now. (You may or may not need to restart R).
+
+# CONGRATULATIONS
+#### You'll be exploring data soon with your new knowledge of R.
--- a/lesson2/reddit.csv
+++ b/lesson2/reddit.csv
--- a/lesson2/stateData.csv
+++ b/lesson2/stateData.csv
@ -0,0 +1,51 @@
+"","state.abb","state.area","state.region","population","income","illiteracy","life.exp","murder","highSchoolGrad","frost","area"
+"Alabama","AL","51609","2","3615","3624","2.1","69.05","15.1","41.3","20","50708"
+"Alaska","AK","589757","4","365","6315","1.5","69.31","11.3","66.7","152","566432"
+"Arizona","AZ","113909","4","2212","4530","1.8","70.55","7.8","58.1","15","113417"
+"Arkansas","AR","53104","2","2110","3378","1.9","70.66","10.1","39.9","65","51945"
+"California","CA","158693","4","21198","5114","1.1","71.71","10.3","62.6","20","156361"
+"Colorado","CO","104247","4","2541","4884","0.7","72.06","6.8","63.9","166","103766"
+"Connecticut","CT","5009","1","3100","5348","1.1","72.48","3.1","56","139","4862"
+"Delaware","DE","2057","2","579","4809","0.9","70.06","6.2","54.6","103","1982"
+"Florida","FL","58560","2","8277","4815","1.3","70.66","10.7","52.6","11","54090"
+"Georgia","GA","58876","2","4931","4091","2","68.54","13.9","40.6","60","58073"
+"Hawaii","HI","6450","4","868","4963","1.9","73.6","6.2","61.9","0","6425"
+"Idaho","ID","83557","4","813","4119","0.6","71.87","5.3","59.5","126","82677"
+"Illinois","IL","56400","3","11197","5107","0.9","70.14","10.3","52.6","127","55748"
+"Indiana","IN","36291","3","5313","4458","0.7","70.88","7.1","52.9","122","36097"
+"Iowa","IA","56290","3","2861","4628","0.5","72.56","2.3","59","140","55941"
+"Kansas","KS","82264","3","2280","4669","0.6","72.58","4.5","59.9","114","81787"
+"Kentucky","KY","40395","2","3387","3712","1.6","70.1","10.6","38.5","95","39650"
+"Louisiana","LA","48523","2","3806","3545","2.8","68.76","13.2","42.2","12","44930"
+"Maine","ME","33215","1","1058","3694","0.7","70.39","2.7","54.7","161","30920"
+"Maryland","MD","10577","2","4122","5299","0.9","70.22","8.5","52.3","101","9891"
+"Massachusetts","MA","8257","1","5814","4755","1.1","71.83","3.3","58.5","103","7826"
+"Michigan","MI","58216","3","9111","4751","0.9","70.63","11.1","52.8","125","56817"
+"Minnesota","MN","84068","3","3921","4675","0.6","72.96","2.3","57.6","160","79289"
+"Mississippi","MS","47716","2","2341","3098","2.4","68.09","12.5","41","50","47296"
+"Missouri","MO","69686","3","4767","4254","0.8","70.69","9.3","48.8","108","68995"
+"Montana","MT","147138","4","746","4347","0.6","70.56","5","59.2","155","145587"
+"Nebraska","NE","77227","3","1544","4508","0.6","72.6","2.9","59.3","139","76483"
+"Nevada","NV","110540","4","590","5149","0.5","69.03","11.5","65.2","188","109889"
+"New Hampshire","NH","9304","1","812","4281","0.7","71.23","3.3","57.6","174","9027"
+"New Jersey","NJ","7836","1","7333","5237","1.1","70.93","5.2","52.5","115","7521"
+"New Mexico","NM","121666","4","1144","3601","2.2","70.32","9.7","55.2","120","121412"
+"New York","NY","49576","1","18076","4903","1.4","70.55","10.9","52.7","82","47831"
+"North Carolina","NC","52586","2","5441","3875","1.8","69.21","11.1","38.5","80","48798"
+"North Dakota","ND","70665","3","637","5087","0.8","72.78","1.4","50.3","186","69273"
+"Ohio","OH","41222","3","10735","4561","0.8","70.82","7.4","53.2","124","40975"
+"Oklahoma","OK","69919","2","2715","3983","1.1","71.42","6.4","51.6","82","68782"
+"Oregon","OR","96981","4","2284","4660","0.6","72.13","4.2","60","44","96184"
+"Pennsylvania","PA","45333","1","11860","4449","1","70.43","6.1","50.2","126","44966"
+"Rhode Island","RI","1214","1","931","4558","1.3","71.9","2.4","46.4","127","1049"
+"South Carolina","SC","31055","2","2816","3635","2.3","67.96","11.6","37.8","65","30225"
+"South Dakota","SD","77047","3","681","4167","0.5","72.08","1.7","53.3","172","75955"
+"Tennessee","TN","42244","2","4173","3821","1.7","70.11","11","41.8","70","41328"
+"Texas","TX","267339","2","12237","4188","2.2","70.9","12.2","47.4","35","262134"
+"Utah","UT","84916","4","1203","4022","0.6","72.9","4.5","67.3","137","82096"
+"Vermont","VT","9609","1","472","3907","0.6","71.64","5.5","57.1","168","9267"
+"Virginia","VA","40815","2","4981","4701","1.4","70.08","9.5","47.8","85","39780"
+"Washington","WA","68192","4","3559","4864","0.6","71.72","4.3","63.5","32","66570"
+"West Virginia","WV","24181","2","1799","3617","1.4","69.48","6.7","41.6","100","24070"
+"Wisconsin","WI","56154","3","4589","4468","0.7","72.48","3","54.5","149","54464"
+"Wyoming","WY","97914","4","376","4566","0.6","70.29","6.9","62.9","173","97203"
--- a/lesson3/lesson3_student.rmd
+++ b/lesson3/lesson3_student.rmd
@ -0,0 +1,283 @@
+Lesson 3
+========================================================
+
+***
+
+### What to Do First?
+Notes:
+
+***
+
+### Pseudo-Facebook User Data
+Notes:
+
+```{r Pseudo-Facebook User Data}
+
+```
+
+***
+
+### Histogram of Users' Birthdays
+Notes:
+
+```{r Histogram of Users\' Birthdays}
+install.packages('ggplot2')
+library(ggplot2)
+```
+
+***
+
+#### What are some things that you notice about this histogram?
+Response:
+
+***
+
+### Moira's Investigation
+Notes:
+
+***
+
+### Estimating Your Audience Size
+Notes:
+
+***
+
+#### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
+Response:
+
+#### How many of your friends do you think saw that post?
+Response:
+
+#### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
+Response:
+
+***
+
+### Perceived Audience Size
+Notes:
+
+***
+### Faceting
+Notes:
+
+```{r Faceting}
+
+```
+
+#### Let’s take another look at our plot. What stands out to you here?
+Response:
+
+***
+
+### Be Skeptical - Outliers and Anomalies
+Notes:
+
+***
+
+### Moira's Outlier
+Notes:
+#### Which case do you think applies to Moira’s outlier?
+Response:
+
+***
+
+### Friend Count
+Notes:
+
+#### What code would you enter to create a histogram of friend counts?
+
+```{r Friend Count}
+
+```
+
+#### How is this plot similar to Moira's first plot?
+Response:
+
+***
+
+### Limiting the Axes
+Notes:
+
+```{r Limiting the Axes}
+
+```
+
+### Exploring with Bin Width
+Notes:
+
+***
+
+### Adjusting the Bin Width
+Notes:
+
+### Faceting Friend Count
+```{r Faceting Friend Count}
+# What code would you add to create a facet the histogram by gender?
+# Add it to the code below.
+qplot(x = friend_count, data = pf, binwidth = 10) +
+  scale_x_continuous(limits = c(0, 1000),
+                     breaks = seq(0, 1000, 50))
+```
+
+***
+
+### Omitting NA Values
+Notes:
+
+```{r Omitting NA Values}
+
+```
+
+***
+
+### Statistics 'by' Gender
+Notes:
+
+```{r Statistics \'by\' Gender}
+
+```
+
+#### Who on average has more friends: men or women?
+Response:
+
+#### What's the difference between the median friend count for women and men?
+Response:
+
+#### Why would the median be a better measure than the mean?
+Response:
+
+***
+
+### Tenure
+Notes:
+
+```{r Tenure}
+
+```
+
+***
+
+#### How would you create a histogram of tenure by year?
+
+```{r Tenure Histogram by Year}
+
+```
+
+***
+
+### Labeling Plots
+Notes:
+
+```{r Labeling Plots}
+
+```
+
+***
+
+### User Ages
+Notes:
+
+```{r User Ages}
+
+```
+
+#### What do you notice?
+Response:
+
+***
+
+### The Spread of Memes
+Notes:
+
+***
+
+### Lada's Money Bag Meme
+Notes:
+
+***
+
+### Transforming Data
+Notes:
+
+***
+
+### Add a Scaling Layer
+Notes:
+
+```{r Add a Scaling Layer}
+
+```
+
+***
+
+
+### Frequency Polygons
+
+```{r Frequency Polygons}
+
+```
+
+***
+
+### Likes on the Web
+Notes:
+
+```{r Likes on the Web}
+
+```
+
+
+***
+
+### Box Plots
+Notes:
+
+```{r Box Plots}
+
+```
+
+#### Adjust the code to focus on users who have friend counts between 0 and 1000.
+
+```{r}
+
+```
+
+***
+
+### Box Plots, Quartiles, and Friendships
+Notes:
+
+```{r Box Plots, Quartiles, and Friendships}
+
+```
+
+#### On average, who initiated more friendships in our sample: men or women?
+Response:
+#### Write about some ways that you can verify your answer.
+Response:
+```{r Friend Requests by Gender}
+
+```
+
+Response:
+
+***
+
+### Getting Logical
+Notes:
+
+```{r Getting Logical}
+
+```
+
+Response:
+
+***
+
+### Analyzing One Variable
+Reflection:
+
+***
+
+Click **KnitHTML** to see all of your hard work and to have an html
+page of this lesson, your answers, and your notes!
--- a/lesson3/pseudo_facebook.tsv
+++ b/lesson3/pseudo_facebook.tsv
--- a/lesson4/correlation_images.jpeg
+++ b/lesson4/correlation_images.jpeg
--- a/lesson4/lesson4_student.rmd
+++ b/lesson4/lesson4_student.rmd
@ -0,0 +1,268 @@
+Lesson 4
+========================================================
+
+***
+
+### Scatterplots and Perceived Audience Size
+Notes:
+
+***
+
+### Scatterplots
+Notes:
+
+```{r Scatterplots}
+
+```
+
+***
+
+#### What are some things that you notice right away?
+Response:
+
+***
+
+### ggplot Syntax
+Notes:
+
+```{r ggplot Syntax}
+
+```
+
+***
+
+### Overplotting
+Notes:
+
+```{r Overplotting}
+
+```
+
+#### What do you notice in the plot?
+Response:
+
+***
+
+### Coord_trans()
+Notes:
+
+```{r Coord_trans()}
+
+```
+
+#### Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!
+
+```{r}
+
+```
+
+#### What do you notice?
+
+***
+
+### Alpha and Jitter
+Notes:
+
+```{r Alpha and Jitter}
+
+```
+
+***
+
+### Overplotting and Domain Knowledge
+Notes:
+
+***
+
+### Conditional Means
+Notes:
+
+```{r Conditional Means}
+
+```
+
+Create your plot!
+
+```{r Conditional Means Plot}
+
+```
+
+***
+
+### Overlaying Summaries with Raw Data
+Notes:
+
+```{r Overlaying Summaries with Raw Data}
+
+```
+
+#### What are some of your observations of the plot?
+Response:
+
+***
+
+### Moira: Histogram Summary and Scatterplot
+See the Instructor Notes of this video to download Moira's paper on perceived audience size and to see the final plot.
+
+Notes:
+
+***
+
+### Correlation
+Notes:
+
+```{r Correlation}
+
+```
+
+Look up the documentation for the cor.test function.
+
+What's the correlation between age and friend count? Round to three decimal places.
+Response:
+
+***
+
+### Correlation on Subsets
+Notes:
+
+```{r Correlation on Subsets}
+with(                 , cor.test(age, friend_count))
+```
+
+***
+
+### Correlation Methods
+Notes:
+
+***
+
+## Create Scatterplots
+Notes:
+
+```{r}
+
+```
+
+***
+
+### Strong Correlations
+Notes:
+
+```{r Strong Correlations}
+
+```
+
+What's the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
+
+```{r Correlation Calcuation}
+
+```
+
+Response:
+
+***
+
+### Moira on Correlation
+Notes:
+
+***
+
+### More Caution with Correlation
+Notes:
+
+```{r More Caution With Correlation}
+install.packages('alr3')
+library(alr3)
+```
+
+Create your plot!
+
+```{r Temp vs Month}
+
+```
+
+***
+
+### Noisy Scatterplots
+a. Take a guess for the correlation coefficient for the scatterplot.
+
+b. What is the actual correlation of the two variables?
+(Round to the thousandths place)
+
+```{r Noisy Scatterplots}
+
+```
+
+***
+
+### Making Sense of Data
+Notes:
+
+```{r Making Sense of Data}
+
+```
+
+***
+
+### A New Perspective
+
+What do you notice?
+Response:
+
+Watch the solution video and check out the Instructor Notes!
+Notes:
+
+***
+
+### Understanding Noise: Age to Age Months
+Notes:
+
+```{r Understanding Noise: Age to Age Months}
+
+```
+
+***
+
+### Age with Months Means
+
+```{r Age with Months Means}
+
+```
+
+Programming Assignment
+```{r Programming Assignment}
+
+```
+
+***
+
+### Noise in Conditional Means
+
+```{r Noise in Conditional Means}
+
+```
+
+***
+
+### Smoothing Conditional Means
+Notes:
+
+```{r Smoothing Conditional Means}
+
+```
+
+***
+
+### Which Plot to Choose?
+Notes:
+
+***
+
+### Analyzing Two Variables
+Reflection:
+
+***
+
+Click **KnitHTML** to see all of your hard work and to have an html
+page of this lesson, your answers, and your notes!
+
--- a/lesson5/lesson5_student.rmd
+++ b/lesson5/lesson5_student.rmd
@ -0,0 +1,253 @@
+Lesson 5
+========================================================
+
+### Multivariate Data
+Notes:
+
+***
+
+### Moira Perceived Audience Size Colored by Age
+Notes:
+
+***
+
+### Third Qualitative Variable
+Notes:
+
+```{r Third Qualitative Variable}
+ggplot(aes(x = gender, y = age),
+       data = subset(pf, !is.na(gender))) + geom_histogram()
+```
+
+***
+
+### Plotting Conditional Summaries
+Notes:
+
+```{r Plotting Conditional Summaries}
+
+```
+
+***
+
+### Thinking in Ratios
+Notes:
+
+***
+
+### Wide and Long Format
+Notes:
+
+***
+
+### Reshaping Data
+Notes:
+
+```{r}
+install.packages('reshape2')
+library(reshape2)
+```
+
+
+***
+
+### Ratio Plot
+Notes:
+
+```{r Ratio Plot}
+
+```
+
+***
+
+### Third Quantitative Variable
+Notes:
+
+```{r Third Quantitative Variable}
+
+```
+
+***
+
+### Cut a Variable
+Notes:
+
+```{r Cut a Variable}
+
+```
+
+***
+
+### Plotting it All Together
+Notes:
+
+```{r Plotting it All Together}
+
+```
+
+***
+
+### Plot the Grand Mean
+Notes:
+
+```{r Plot the Grand Mean}
+
+```
+
+***
+
+### Friending Rate
+Notes:
+
+```{r Friending Rate}
+
+```
+
+***
+
+### Friendships Initiated
+Notes:
+
+What is the median friend rate?
+
+What is the maximum friend rate?
+
+```{r Friendships Initiated}
+
+```
+
+***
+
+### Bias-Variance Tradeoff Revisited
+Notes:
+
+```{r Bias-Variance Tradeoff Revisited}
+
+ggplot(aes(x = tenure, y = friendships_initiated / tenure),
+       data = subset(pf, tenure >= 1)) +
+  geom_line(aes(color = year_joined.bucket),
+            stat = 'summary',
+            fun.y = mean)
+
+ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
+       data = subset(pf, tenure > 0)) +
+  geom_line(aes(color = year_joined.bucket),
+            stat = "summary",
+            fun.y = mean)
+
+ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
+       data = subset(pf, tenure > 0)) +
+  geom_line(aes(color = year_joined.bucket),
+            stat = "summary",
+            fun.y = mean)
+
+ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
+       data = subset(pf, tenure > 0)) +
+  geom_line(aes(color = year_joined.bucket),
+            stat = "summary",
+            fun.y = mean)
+
+```
+
+***
+
+### Sean's NFL Fan Sentiment Study
+Notes:
+
+***
+
+### Introducing the Yogurt Data Set
+Notes:
+
+***
+
+### Histograms Revisited
+Notes:
+
+```{r Histograms Revisited}
+
+```
+
+***
+
+### Number of Purchases
+Notes:
+
+```{r Number of Purchases}
+
+```
+
+***
+
+### Prices over Time
+Notes:
+
+```{r Prices over Time}
+
+```
+
+***
+
+### Sampling Observations
+Notes:
+
+***
+
+### Looking at Samples of Households
+
+```{r Looking at Sample of Households}
+
+```
+
+***
+
+### The Limits of Cross Sectional Data
+Notes:
+
+***
+
+### Many Variables
+Notes:
+
+***
+
+### Scatterplot Matrix
+Notes:
+
+***
+
+### Even More Variables
+Notes:
+
+***
+
+### Heat Maps
+Notes:
+
+```{r}
+nci <- read.table("nci.tsv")
+colnames(nci) <- c(1:64)
+```
+
+```{r}
+nci.long.samp <- melt(as.matrix(nci[1:200,]))
+names(nci.long.samp) <- c("gene", "case", "value")
+head(nci.long.samp)
+
+ggplot(aes(y = gene, x = case, fill = value),
+  data = nci.long.samp) +
+  geom_tile() +
+  scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
+```
+
+
+***
+
+### Analyzing Three of More Variables
+Reflection:
+
+***
+
+Click **KnitHTML** to see all of your hard work and to have an html
+page of this lesson, your answers, and your notes!
+
--- a/lesson5/nci.tsv
+++ b/lesson5/nci.tsv
--- a/lesson5/scatterplotMatrix.pdf
+++ b/lesson5/scatterplotMatrix.pdf
--- a/lesson5/yogurt.csv
+++ b/lesson5/yogurt.csv
--- a/lesson6/diamondsbig.csv
+++ b/lesson6/diamondsbig.csv
--- a/lesson6/ggpairs_landscape.pdf
+++ b/lesson6/ggpairs_landscape.pdf
--- a/lesson6/lesson6_student.rmd
+++ b/lesson6/lesson6_student.rmd
@ -0,0 +1,289 @@
+Lesson 6
+========================================================
+
+### Welcome
+Notes:
+
+***
+
+### Scatterplot Review
+
+```{r Scatterplot Review}
+
+```
+
+***
+
+### Price and Carat Relationship
+Response:
+
+***
+
+### Frances Gerety
+Notes:
+
+#### A diamonds is
+
+
+***
+
+### The Rise of Diamonds
+Notes:
+
+***
+
+### ggpairs Function
+Notes:
+
+```{r ggpairs Function}
+# install these if necessary
+install.packages('GGally')
+install.packages('scales')
+install.packages('memisc')
+install.packages('lattice')
+install.packages('MASS')
+install.packages('car')
+install.packages('reshape')
+install.packages('plyr')
+
+# load the ggplot graphics package and the others
+library(ggplot2)
+library(GGally)
+library(scales)
+library(memisc)
+
+# sample 10,000 diamonds from the data set
+set.seed(20022012)
+diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
+ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
+```
+
+What are some things you notice in the ggpairs output?
+Response:
+
+***
+
+### The Demand of Diamonds
+Notes:
+
+```{r The Demand of Diamonds}
+
+```
+
+***
+
+### Connecting Demand and Price Distributions
+Notes:
+
+***
+
+### Scatterplot Transformation
+
+```{r Scatterplot Transformation}
+
+```
+
+
+### Create a new function to transform the carat variable
+
+```{r cuberoot transformation}
+cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
+                                      inverse = function(x) x^3)
+```
+
+#### Use the cuberoot_trans function
+```{r Use cuberoot_trans}
+ggplot(aes(carat, price), data = diamonds) + 
+  geom_point() + 
+  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
+                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
+  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
+                     breaks = c(350, 1000, 5000, 10000, 15000)) +
+  ggtitle('Price (log10) by Cube-Root of Carat')
+```
+
+***
+
+### Overplotting Revisited
+
+```{r Sort and Head Tables}
+
+```
+
+
+```{r Overplotting Revisited}
+ggplot(aes(carat, price), data = diamonds) + 
+  geom_point() + 
+  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
+                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
+  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
+                     breaks = c(350, 1000, 5000, 10000, 15000)) +
+  ggtitle('Price (log10) by Cube-Root of Carat')
+```
+
+***
+
+### Other Qualitative Factors
+Notes:
+
+***
+
+### Price vs. Carat and Clarity
+
+Alter the code below.
+```{r Price vs. Carat and Clarity}
+# install and load the RColorBrewer package
+install.packages('RColorBrewer')
+library(RColorBrewer)
+
+ggplot(aes(x = carat, y = price), data = diamonds) + 
+  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
+  scale_color_brewer(type = 'div',
+    guide = guide_legend(title = 'Clarity', reverse = T,
+    override.aes = list(alpha = 1, size = 2))) +  
+  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
+    breaks = c(0.2, 0.5, 1, 2, 3)) + 
+  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
+    breaks = c(350, 1000, 5000, 10000, 15000)) +
+  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
+```
+
+***
+
+### Clarity and Price
+Response:
+
+***
+
+### Price vs. Carat and Cut
+
+Alter the code below.
+```{r Price vs. Carat and Cut}
+ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
+  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
+  scale_color_brewer(type = 'div',
+                     guide = guide_legend(title = 'Clarity', reverse = T,
+                                          override.aes = list(alpha = 1, size = 2))) +  
+  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
+                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
+  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
+                     breaks = c(350, 1000, 5000, 10000, 15000)) +
+  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
+```
+
+***
+
+### Cut and Price
+Response:
+
+***
+
+### Price vs. Carat and Color
+
+Alter the code below.
+```{r Price vs. Carat and Color}
+ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
+  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
+  scale_color_brewer(type = 'div',
+                     guide = guide_legend(title = Cut, reverse = T,
+                                          override.aes = list(alpha = 1, size = 2))) +  
+  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
+                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
+  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
+                     breaks = c(350, 1000, 5000, 10000, 15000)) +
+  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
+```
+
+***
+
+### Color and Price
+Response:
+
+***
+
+### Linear Models in R
+Notes:
+
+Response:
+
+***
+
+### Building the Linear Model
+Notes:
+
+```{r Building the Linear Model}
+m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
+m2 <- update(m1, ~ . + carat)
+m3 <- update(m2, ~ . + cut)
+m4 <- update(m3, ~ . + color)
+m5 <- update(m4, ~ . + clarity)
+mtable(m1, m2, m3, m4, m5)
+```
+
+Notice how adding cut to our model does not help explain much of the variance
+in the price of diamonds. This fits with out exploration earlier.
+
+***
+
+### Model Problems
+Video Notes:
+
+Research:
+(Take some time to come up with 2-4 problems for the model)
+(You should 10-20 min on this)
+
+Response:
+
+***
+
+### A Bigger, Better Data Set
+Notes:
+
+```{r A Bigger, Better Data Set}
+install.package('bitops')
+install.packages('RCurl')
+library('bitops')
+library('RCurl')
+
+diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
+load(rawConnection(diamondsurl))
+```
+
+The code used to obtain the data is available here:
+https://github.com/solomonm/diamonds-data
+
+## Building a Model Using the Big Diamonds Data Set
+Notes:
+
+```{r Building a Model Using the Big Diamonds Data Set}
+
+```
+
+
+***
+
+## Predictions
+
+Example Diamond from BlueNile:
+Round 1.00 Very Good I VS1 $5,601
+
+```{r}
+#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
+thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
+                         color = "I", clarity="VS1")
+modelEstimate = predict(m5, newdata = thisDiamond,
+                        interval="prediction", level = .95)
+```
+
+Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
+
+***
+
+## Final Thoughts
+Notes:
+
+***
+
+Click **KnitHTML** to see all of your hard work and to have an html
+page of this lesson, your answers, and your notes!
+