Initial Commit with Project Code

2018-04-17 19:56:59 -08:00 · 2018-04-17 19:56:59 -08:00 · 6f865b5ff5
commit 6f865b5ff5
parent bea57818a2
16 changed files with 707837 additions and 0 deletions
--- a/lesson2/What_is_a_RMD_file.Rmd
+++ b/lesson2/What_is_a_RMD_file.Rmd
@ -0,0 +1,12 @@
 Title
 ========================================================
 This is an R Markdown document or RMD. Markdown is a simple formatting syntax for authoring web pages (click the **Help** toolbar button for more details on using R Markdown).
 When you click the **Knit HTML** button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document.
--- a/lesson2/demystifying.R
+++ b/lesson2/demystifying.R
@ -0,0 +1,261 @@
 # The goal of this file is to introduce you to the
 # R programming language. Let's start with by unraveling a
 # little mystery!
 # 1. Run the code below to create the vector 'udacious'.
 # You need to highlight all of the lines of the code and then
 # run it. You should see "udacious" appear in the workspace.
 udacious <- c("Chris Saden", "Lauren Castellano",
              "Sarah Spikes","Dean Eckles",
              "Andy Brown", "Moira Burke",
              "Kunal Chawla")
 # You should see something like "chr[1:7]" in the 'Environment'
 # or 'Workspace' tab. This is because you created a 'vector' with
 # 7 names that have a 'type' of character. The arrow-like
 # '<-' symbol is the assignment operator in R, similar to the
 # equal sign '=' in other programming languages. The c() is a
 # generic function that combines arguments, in this case the
 # names of people, to form a vector.
 # A 'vector' is one of the data types in R. Vectors must contain
 # the same type of data, that is the entries must all be of the
 # same type: character (most programmers call these strings),
 # logical (TRUE or FALSE), or numeric.
 # Print out the vector udacious by running this next line of code.
 udacious
 # Notice how there are numbers next to the output.
 # Each number corresponds to the index of the entry in the vector.
 # Chris Saden is the first entry so [1]
 # Dean Eckles is the fourth entry so [4]
 # Kunal Chawla is the seventh entry so [7]
 # Depending on the size of you window you may see different numbers
 # in the output.
 # ANOTHER HELPFUL TIP: You can add values to a vector.
 # Run each line of code one at a time below to see what is happening.
 numbers <- c(1:10)
 numbers
 numbers <- c(numbers, 11:20)
 numbers
 # 2. Replace YOUR_NAME with your actual name in the vector
 # 'udacious' and run the code. Be sure to use quotes around it.
 udacious <- c("Chris Saden", "Lauren Castellano",
              "Sarah Spikes","Dean Eckles",
              "Andy Brown", "Moira Burke",
              "Kunal Chawla", YOUR_NAME)
 # Notice how R updates 'udacious' in the workspace.
 # It should now say something like 'chr[1:8]'.
 # 3. Run the following two lines of code. You can highlight both lines
 # of code and run them.
 mystery = nchar(udacious)
 mystery
 # You just created a new vector called mystery. What do you
 # think is in this vector? (scroll down for the answer)
 # Mystery is a vector that contains the number of characters
 # for each of the names in udacious, including your name.
 # 4. Run this next line of code.
 mystery == 11
 # Here we get a logical (or boolean) vector that tells us
 # which locations or indices in the vector contain a name
 # that has exactly 11 characters.
 # 5. Let's use this boolean vector, mystery, to subset our
 # udacious vector. What do you think the result will be when
 # running the line of code below?
 # Think about the output before you run this next line of code.
 # Notice how there are brackets in the code. Brackets are often
 # used in R for subsetting.
 udacious[mystery == 11]
 # Scroll down for the answer
 # It's your Udacious Instructors for the course!
 # (and you may be in the output if you're lucky enough
 # to have 11 characters in YOUR_NAME) Either way, we
 # think you're pretty udacious for taking this course.
 # 6. Alright, all mystery aside...let's dive into some data!
 # The R installation has a few datasets already built into it
 # that you can play with. Right now, you'll load one of these,
 # which is named mtcars.
 # Run this next command to load the mtcars data.
 data(mtcars)
 # You should see mtcars appear in the 'Environment' tab with
 # <Promise> listed next to it. 
 # The object (mtcars) appears as a 'Promise' object in the
 # workspace until we run some code that uses the object.
 # R has stored the mtcars data into a spreadsheet-like object
 # called a data frame. Run the next command to see what variables
 # are in the data set and to fully load the data set as an
 # object in R. You should see <Promise> disappear when you
 # run the next line of code.
 # Visit http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Promise-objects
 # if you want the expert insight on Promise objects. You won't
 # need to the info on Promise objects to be successful in this course.
 names(mtcars)
 # names(mtcars) should output all the variable
 # names in the data set. You might notice that the car names
 # are not a variable in the data set. The car names have been saved
 # as row names. More on this later.
 # You should also see how many observations (obs.) are in the
 # the data frame and the number of variables on each observation.
 # 7. To get more information on the data set and the variables
 # run the this next line of code.
 ?mtcars
 # You can type a '?' before any command or a data set to learn
 # more about it. The details and documentation will appear in
 # the 'Help' tab.
 # 8. To print out the data, run this next line as code.
 mtcars
 # Scroll up and down in the console to check out the data.
 # This is the entire data frame printed out.
 # 9. Run these next two functions, one at a time,
 # and see if you can figure out what they do.
 str(mtcars)
 dim(mtcars)
 # Scroll down for the answer.
 # The first command, str(mtcars), gives us the structure of the
 # data frame. It lists the variable names, the type of each variable
 # (all of these variables are numerics) and some values for each
 # variable.
 # The second command, dim(mtcars), should output '[1] 32 11'
 # to the console. The [1] indicates that 32 is the first value
 # in the output.
 # R uses 1 to start indexing (AND NOT ZERO BASED INDEXING as is true
 # of many other programming languages.)
 # 10. Read the documentation for row.names if you're want to know more.
 ?row.names
 # Run this code to see the current row names in the data frame.
 row.names(mtcars)
 # Run this code to change the row names of the cars to numbers.
 row.names(mtcars) <- c(1:32)
 # Now print out the data frame by running the code below.
 mtcars
 # It's tedious to relabel our data frame with the right car names
 # so let's reload the data set and print out the first ten rows.
 data(mtcars)
 head(mtcars, 10)
 # The head() function prints out the first six rows of a data frame
 # by default. Run the code below to see.
 head(mtcars)
 # I think you'll know what this does.
 tail(mtcars, 3)
 # 11. We've run nine commands so far:
 #      c, nchar, data, str, dim, names, row.names, head, and tail.
 # All of these commands took some inputs or arguments.
 # To determine if a command takes more arguments or to learn
 # about any default settings, you can look up the documentation
 # using '?' before the command, much like you did to learn about
 # the mtcars data set and the row.names
 # 12. Let's examine our car data more closely. We can access an
 # an individual variable (or column) from the data frame using
 # the '$' sign. Run the code below to print out the variable
 # miles per gallon. This is the mpg column in the data frame.
 mtcars$mpg
 # Print out any two other variables to the console.
 # This is a vector containing the mpg (miles per gallon) of
 # the 32 cars. Run this next line of code to get the average mpg for
 # for all the cars. What is it?
 # Enter this number for the quiz on the Udacity website.
 # https://www.udacity.com/course/viewer#!/c-ud651/l-729069797/e-804129314/m-830829287
 mean(mtcars$mpg)
--- a/lesson2/demystifyingR2.Rmd
+++ b/lesson2/demystifyingR2.Rmd
@ -0,0 +1,179 @@
 Demystifying R Part 2
 ========================================================
 You might see a warning message just above this file. Something like...
 "R Markdown requires the knitr package (version 1.2 or higher)"
 Don't worry about this for now. We'll address it at the end of this file.
 1. Run the following command to see what it does.
 ```{r}
 summary(mtcars)
 ```
 If you know about quantiles, then the output should look familiar.
 If not, you probably recognize the min (minimum), median, mean, and max (maximum).
 We'll go over quantiles in Lesson 3 so don't worry if the output seems overwhelming.
 The str() and summary() functions are helpful commands when working with a new data set.
 The str() function gives us the variable names and their types.
 The summary() function gives us an idea of the values a variable can take on.
 2. In 2013, the average mpg (miles per gallon) for a car was 23 mpg.
 The car models in the mtcars data set come from the year 1973-1974.
 Subset the data so that you create a new data frame that contains
 cars that get 23 or more mpg (miles per gallon). Save it to a new data
 frame called efficient.
 ```{r}
 ```
 3. How many cars get more than 23 mpg? Use one of the commands you
 learned in the demystifying.R to answer this question.
 ```{r}
 ```
 4. We can also use logical operators to find out which car(s) get greater
 than 30 miles per gallon (mpg) and have more than 100 raw horsepower.
 ```{r}
 subset(mtcars, mpg > 30 & hp > 100)
 ```
 There's only one car that gets more than 30 mpg and 100 hp.
 5. What do you think this code does? Scroll down for the answer.
 ```{r}
 subset(mtcars, mpg < 14 | disp > 390)
 ```
 Note: You may be familiar with the || operator in Java. R uses one single & for the logical
 operator AND. It also uses one | for the logical operator OR.
 The command above creates a data frame of cars that have mpg less than 14
 OR a displacement of more than 390. Only one of the conditions for a car
 needs to be satisfied so that the car makes it into the subset. Any of the
 cars that fit the criteria are printed to the console.
 Now you try some.
 6. Print the cars that have a 1/4 mile time (qsec) less than or equal to
 16.90 seconds to the console.
 ```{r}
 ```
 7. Save the subset of cars that weigh under 2000 pounds (weight is measured in lb/1000)
 to a variable called lightCars. Print the numbers of cars and the subset to the console.
 ```{r}
 ```
 8. You can also create new variables in a data frame. Let's say you wanted
 to have the year of each car's model. We can create the variable
 mtcars$year. Here we'll assume that all of the models were from 1974.
 Run the code below.
 ```{r}
 mtcars$year <- 1974
 ```
 Notice how the number of variables changed in the work space. You can
 also see the result by double clicking on mtcars in the workspace and
 examining the data in a table.
 To drop a variable, subset the data frame and select the variable you
 want to drop with a negative sign in front of it.
 ```{r}
 mtcars <- subset(mtcars, select = -year)
 ```
 Notice, we are back to 11 variables in the data frame.
 9. What do you think this code does? Run it to find out.
 ```{r}
 mtcars$year <- c(1973, 1974)
 ```
 Open the table of values to see what values year takes on.
 Drop the year variable from the data set.
 ```{r}
 ```
 10. Now you are going to get a preview of ifelse(). For those new
 to programming this example may be confusing. See if you can understand
 the code by running the commands one line at a time. Read the output and
 make sense of what the code is doing at each step.
 If you are having trouble don't worry, we will review the ifelse statement
 at the end of Lesson 3. You won't be quizzed on it, and it's not essential
 to keep going in this course. We just want you to try to get familiar with
 more code.
 ```{r}
 mtcars$wt
 cond <- mtcars$wt < 3
 cond
 mtcars$weight_class <- ifelse(cond, 'light', 'average')
 mtcars$weight_class
 cond <- mtcars$wt > 3.5
 mtcars$weight_class <- ifelse(cond, 'heavy', mtcars$weight_class)
 mtcars$weight_class
 ```
 You have some variables in your workspace or environment like 'cond' and
 efficient. You want to be careful that you don't bring in too much data
 into R at once since R will hold all the data in working memory. We have
 nothing to worry about here, but let's delete those variables from the
 work space.
 ```{r}
 rm(cond)
 rm(efficient)
 ```
 Save this file if you haven't done so yet.
 You'll have the opportunity to create one Rmd file for the final project in
 this class and submit the Rmd file and knitted output (or HTML file). You'll
 need the knitr package to do that so let's install that now. **Uncomment** the
 following two lines of code and run them.
 ```{r}
 # install.packages('knitr', dependencies = T)
 # library(knitr)
 ```
 Once you've installed knitr, **comment** out the two lines of code above.
 When you click the **Knit HTML** button a web page will be generated that
 includes both content (text and text formatting from Markdown) as well as
 the output of any embedded R code chunks within the document.
 You've reached the end of the file so now it's time to write some code to
 answer a question to continue on in Lesson 2.
 Which car(s) have an mpg (miles per gallon) greater than or equal to 30
 OR hp (horsepower) less than 60? Create an R chunk of code to answer the question.
 Once you have the answer, go the [Udacity website](https://www.udacity.com/course/viewer#!/c-ud651/l-729069797/e-804129319/m-811719066) to continue with Lesson 2.
 Note: You use brackets around text followed by two parentheses to create a link.
 There must be no spaces between the brackets and the parentheses. Paste or type
 the link into the parentheses. This also works on the discussions!
 And if you want to see all of your HARD WORK from this file, click
 the **KNIT HTML** button now. (You may or may not need to restart R).
 # CONGRATULATIONS
 #### You'll be exploring data soon with your new knowledge of R.
--- a/lesson2/reddit.csv
+++ b/lesson2/reddit.csv
--- a/lesson2/stateData.csv
+++ b/lesson2/stateData.csv
@ -0,0 +1,51 @@
 "","state.abb","state.area","state.region","population","income","illiteracy","life.exp","murder","highSchoolGrad","frost","area"
 "Alabama","AL","51609","2","3615","3624","2.1","69.05","15.1","41.3","20","50708"
 "Alaska","AK","589757","4","365","6315","1.5","69.31","11.3","66.7","152","566432"
 "Arizona","AZ","113909","4","2212","4530","1.8","70.55","7.8","58.1","15","113417"
 "Arkansas","AR","53104","2","2110","3378","1.9","70.66","10.1","39.9","65","51945"
 "California","CA","158693","4","21198","5114","1.1","71.71","10.3","62.6","20","156361"
 "Colorado","CO","104247","4","2541","4884","0.7","72.06","6.8","63.9","166","103766"
 "Connecticut","CT","5009","1","3100","5348","1.1","72.48","3.1","56","139","4862"
 "Delaware","DE","2057","2","579","4809","0.9","70.06","6.2","54.6","103","1982"
 "Florida","FL","58560","2","8277","4815","1.3","70.66","10.7","52.6","11","54090"
 "Georgia","GA","58876","2","4931","4091","2","68.54","13.9","40.6","60","58073"
 "Hawaii","HI","6450","4","868","4963","1.9","73.6","6.2","61.9","0","6425"
 "Idaho","ID","83557","4","813","4119","0.6","71.87","5.3","59.5","126","82677"
 "Illinois","IL","56400","3","11197","5107","0.9","70.14","10.3","52.6","127","55748"
 "Indiana","IN","36291","3","5313","4458","0.7","70.88","7.1","52.9","122","36097"
 "Iowa","IA","56290","3","2861","4628","0.5","72.56","2.3","59","140","55941"
 "Kansas","KS","82264","3","2280","4669","0.6","72.58","4.5","59.9","114","81787"
 "Kentucky","KY","40395","2","3387","3712","1.6","70.1","10.6","38.5","95","39650"
 "Louisiana","LA","48523","2","3806","3545","2.8","68.76","13.2","42.2","12","44930"
 "Maine","ME","33215","1","1058","3694","0.7","70.39","2.7","54.7","161","30920"
 "Maryland","MD","10577","2","4122","5299","0.9","70.22","8.5","52.3","101","9891"
 "Massachusetts","MA","8257","1","5814","4755","1.1","71.83","3.3","58.5","103","7826"
 "Michigan","MI","58216","3","9111","4751","0.9","70.63","11.1","52.8","125","56817"
 "Minnesota","MN","84068","3","3921","4675","0.6","72.96","2.3","57.6","160","79289"
 "Mississippi","MS","47716","2","2341","3098","2.4","68.09","12.5","41","50","47296"
 "Missouri","MO","69686","3","4767","4254","0.8","70.69","9.3","48.8","108","68995"
 "Montana","MT","147138","4","746","4347","0.6","70.56","5","59.2","155","145587"
 "Nebraska","NE","77227","3","1544","4508","0.6","72.6","2.9","59.3","139","76483"
 "Nevada","NV","110540","4","590","5149","0.5","69.03","11.5","65.2","188","109889"
 "New Hampshire","NH","9304","1","812","4281","0.7","71.23","3.3","57.6","174","9027"
 "New Jersey","NJ","7836","1","7333","5237","1.1","70.93","5.2","52.5","115","7521"
 "New Mexico","NM","121666","4","1144","3601","2.2","70.32","9.7","55.2","120","121412"
 "New York","NY","49576","1","18076","4903","1.4","70.55","10.9","52.7","82","47831"
 "North Carolina","NC","52586","2","5441","3875","1.8","69.21","11.1","38.5","80","48798"
 "North Dakota","ND","70665","3","637","5087","0.8","72.78","1.4","50.3","186","69273"
 "Ohio","OH","41222","3","10735","4561","0.8","70.82","7.4","53.2","124","40975"
 "Oklahoma","OK","69919","2","2715","3983","1.1","71.42","6.4","51.6","82","68782"
 "Oregon","OR","96981","4","2284","4660","0.6","72.13","4.2","60","44","96184"
 "Pennsylvania","PA","45333","1","11860","4449","1","70.43","6.1","50.2","126","44966"
 "Rhode Island","RI","1214","1","931","4558","1.3","71.9","2.4","46.4","127","1049"
 "South Carolina","SC","31055","2","2816","3635","2.3","67.96","11.6","37.8","65","30225"
 "South Dakota","SD","77047","3","681","4167","0.5","72.08","1.7","53.3","172","75955"
 "Tennessee","TN","42244","2","4173","3821","1.7","70.11","11","41.8","70","41328"
 "Texas","TX","267339","2","12237","4188","2.2","70.9","12.2","47.4","35","262134"
 "Utah","UT","84916","4","1203","4022","0.6","72.9","4.5","67.3","137","82096"
 "Vermont","VT","9609","1","472","3907","0.6","71.64","5.5","57.1","168","9267"
 "Virginia","VA","40815","2","4981","4701","1.4","70.08","9.5","47.8","85","39780"
 "Washington","WA","68192","4","3559","4864","0.6","71.72","4.3","63.5","32","66570"
 "West Virginia","WV","24181","2","1799","3617","1.4","69.48","6.7","41.6","100","24070"
 "Wisconsin","WI","56154","3","4589","4468","0.7","72.48","3","54.5","149","54464"
 "Wyoming","WY","97914","4","376","4566","0.6","70.29","6.9","62.9","173","97203"
--- a/lesson3/lesson3_student.rmd
+++ b/lesson3/lesson3_student.rmd
@ -0,0 +1,283 @@
 Lesson 3
 ========================================================
 ***
 ### What to Do First?
 Notes:
 ***
 ### Pseudo-Facebook User Data
 Notes:
 ```{r Pseudo-Facebook User Data}
 ```
 ***
 ### Histogram of Users' Birthdays
 Notes:
 ```{r Histogram of Users\' Birthdays}
 install.packages('ggplot2')
 library(ggplot2)
 ```
 ***
 #### What are some things that you notice about this histogram?
 Response:
 ***
 ### Moira's Investigation
 Notes:
 ***
 ### Estimating Your Audience Size
 Notes:
 ***
 #### Think about a time when you posted a specific message or shared a photo on Facebook. What was it?
 Response:
 #### How many of your friends do you think saw that post?
 Response:
 #### Think about what percent of your friends on Facebook see any posts or comments that you make in a month. What percent do you think that is?
 Response:
 ***
 ### Perceived Audience Size
 Notes:
 ***
 ### Faceting
 Notes:
 ```{r Faceting}
 ```
 #### Let’s take another look at our plot. What stands out to you here?
 Response:
 ***
 ### Be Skeptical - Outliers and Anomalies
 Notes:
 ***
 ### Moira's Outlier
 Notes:
 #### Which case do you think applies to Moira’s outlier?
 Response:
 ***
 ### Friend Count
 Notes:
 #### What code would you enter to create a histogram of friend counts?
 ```{r Friend Count}
 ```
 #### How is this plot similar to Moira's first plot?
 Response:
 ***
 ### Limiting the Axes
 Notes:
 ```{r Limiting the Axes}
 ```
 ### Exploring with Bin Width
 Notes:
 ***
 ### Adjusting the Bin Width
 Notes:
 ### Faceting Friend Count
 ```{r Faceting Friend Count}
 # What code would you add to create a facet the histogram by gender?
 # Add it to the code below.
 qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000),
                     breaks = seq(0, 1000, 50))
 ```
 ***
 ### Omitting NA Values
 Notes:
 ```{r Omitting NA Values}
 ```
 ***
 ### Statistics 'by' Gender
 Notes:
 ```{r Statistics \'by\' Gender}
 ```
 #### Who on average has more friends: men or women?
 Response:
 #### What's the difference between the median friend count for women and men?
 Response:
 #### Why would the median be a better measure than the mean?
 Response:
 ***
 ### Tenure
 Notes:
 ```{r Tenure}
 ```
 ***
 #### How would you create a histogram of tenure by year?
 ```{r Tenure Histogram by Year}
 ```
 ***
 ### Labeling Plots
 Notes:
 ```{r Labeling Plots}
 ```
 ***
 ### User Ages
 Notes:
 ```{r User Ages}
 ```
 #### What do you notice?
 Response:
 ***
 ### The Spread of Memes
 Notes:
 ***
 ### Lada's Money Bag Meme
 Notes:
 ***
 ### Transforming Data
 Notes:
 ***
 ### Add a Scaling Layer
 Notes:
 ```{r Add a Scaling Layer}
 ```
 ***
 ### Frequency Polygons
 ```{r Frequency Polygons}
 ```
 ***
 ### Likes on the Web
 Notes:
 ```{r Likes on the Web}
 ```
 ***
 ### Box Plots
 Notes:
 ```{r Box Plots}
 ```
 #### Adjust the code to focus on users who have friend counts between 0 and 1000.
 ```{r}
 ```
 ***
 ### Box Plots, Quartiles, and Friendships
 Notes:
 ```{r Box Plots, Quartiles, and Friendships}
 ```
 #### On average, who initiated more friendships in our sample: men or women?
 Response:
 #### Write about some ways that you can verify your answer.
 Response:
 ```{r Friend Requests by Gender}
 ```
 Response:
 ***
 ### Getting Logical
 Notes:
 ```{r Getting Logical}
 ```
 Response:
 ***
 ### Analyzing One Variable
 Reflection:
 ***
 Click **KnitHTML** to see all of your hard work and to have an html
 page of this lesson, your answers, and your notes!
--- a/lesson3/pseudo_facebook.tsv
+++ b/lesson3/pseudo_facebook.tsv
--- a/lesson4/correlation_images.jpeg
+++ b/lesson4/correlation_images.jpeg
--- a/lesson4/lesson4_student.rmd
+++ b/lesson4/lesson4_student.rmd
@ -0,0 +1,268 @@
 Lesson 4
 ========================================================
 ***
 ### Scatterplots and Perceived Audience Size
 Notes:
 ***
 ### Scatterplots
 Notes:
 ```{r Scatterplots}
 ```
 ***
 #### What are some things that you notice right away?
 Response:
 ***
 ### ggplot Syntax
 Notes:
 ```{r ggplot Syntax}
 ```
 ***
 ### Overplotting
 Notes:
 ```{r Overplotting}
 ```
 #### What do you notice in the plot?
 Response:
 ***
 ### Coord_trans()
 Notes:
 ```{r Coord_trans()}
 ```
 #### Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!
 ```{r}
 ```
 #### What do you notice?
 ***
 ### Alpha and Jitter
 Notes:
 ```{r Alpha and Jitter}
 ```
 ***
 ### Overplotting and Domain Knowledge
 Notes:
 ***
 ### Conditional Means
 Notes:
 ```{r Conditional Means}
 ```
 Create your plot!
 ```{r Conditional Means Plot}
 ```
 ***
 ### Overlaying Summaries with Raw Data
 Notes:
 ```{r Overlaying Summaries with Raw Data}
 ```
 #### What are some of your observations of the plot?
 Response:
 ***
 ### Moira: Histogram Summary and Scatterplot
 See the Instructor Notes of this video to download Moira's paper on perceived audience size and to see the final plot.
 Notes:
 ***
 ### Correlation
 Notes:
 ```{r Correlation}
 ```
 Look up the documentation for the cor.test function.
 What's the correlation between age and friend count? Round to three decimal places.
 Response:
 ***
 ### Correlation on Subsets
 Notes:
 ```{r Correlation on Subsets}
 with(                 , cor.test(age, friend_count))
 ```
 ***
 ### Correlation Methods
 Notes:
 ***
 ## Create Scatterplots
 Notes:
 ```{r}
 ```
 ***
 ### Strong Correlations
 Notes:
 ```{r Strong Correlations}
 ```
 What's the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
 ```{r Correlation Calcuation}
 ```
 Response:
 ***
 ### Moira on Correlation
 Notes:
 ***
 ### More Caution with Correlation
 Notes:
 ```{r More Caution With Correlation}
 install.packages('alr3')
 library(alr3)
 ```
 Create your plot!
 ```{r Temp vs Month}
 ```
 ***
 ### Noisy Scatterplots
 a. Take a guess for the correlation coefficient for the scatterplot.
 b. What is the actual correlation of the two variables?
 (Round to the thousandths place)
 ```{r Noisy Scatterplots}
 ```
 ***
 ### Making Sense of Data
 Notes:
 ```{r Making Sense of Data}
 ```
 ***
 ### A New Perspective
 What do you notice?
 Response:
 Watch the solution video and check out the Instructor Notes!
 Notes:
 ***
 ### Understanding Noise: Age to Age Months
 Notes:
 ```{r Understanding Noise: Age to Age Months}
 ```
 ***
 ### Age with Months Means
 ```{r Age with Months Means}
 ```
 Programming Assignment
 ```{r Programming Assignment}
 ```
 ***
 ### Noise in Conditional Means
 ```{r Noise in Conditional Means}
 ```
 ***
 ### Smoothing Conditional Means
 Notes:
 ```{r Smoothing Conditional Means}
 ```
 ***
 ### Which Plot to Choose?
 Notes:
 ***
 ### Analyzing Two Variables
 Reflection:
 ***
 Click **KnitHTML** to see all of your hard work and to have an html
 page of this lesson, your answers, and your notes!
--- a/lesson5/lesson5_student.rmd
+++ b/lesson5/lesson5_student.rmd
@ -0,0 +1,253 @@
 Lesson 5
 ========================================================
 ### Multivariate Data
 Notes:
 ***
 ### Moira Perceived Audience Size Colored by Age
 Notes:
 ***
 ### Third Qualitative Variable
 Notes:
 ```{r Third Qualitative Variable}
 ggplot(aes(x = gender, y = age),
       data = subset(pf, !is.na(gender))) + geom_histogram()
 ```
 ***
 ### Plotting Conditional Summaries
 Notes:
 ```{r Plotting Conditional Summaries}
 ```
 ***
 ### Thinking in Ratios
 Notes:
 ***
 ### Wide and Long Format
 Notes:
 ***
 ### Reshaping Data
 Notes:
 ```{r}
 install.packages('reshape2')
 library(reshape2)
 ```
 ***
 ### Ratio Plot
 Notes:
 ```{r Ratio Plot}
 ```
 ***
 ### Third Quantitative Variable
 Notes:
 ```{r Third Quantitative Variable}
 ```
 ***
 ### Cut a Variable
 Notes:
 ```{r Cut a Variable}
 ```
 ***
 ### Plotting it All Together
 Notes:
 ```{r Plotting it All Together}
 ```
 ***
 ### Plot the Grand Mean
 Notes:
 ```{r Plot the Grand Mean}
 ```
 ***
 ### Friending Rate
 Notes:
 ```{r Friending Rate}
 ```
 ***
 ### Friendships Initiated
 Notes:
 What is the median friend rate?
 What is the maximum friend rate?
 ```{r Friendships Initiated}
 ```
 ***
 ### Bias-Variance Tradeoff Revisited
 Notes:
 ```{r Bias-Variance Tradeoff Revisited}
 ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure >= 1)) +
  geom_line(aes(color = year_joined.bucket),
            stat = 'summary',
            fun.y = mean)
 ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)
 ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)
 ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)
 ```
 ***
 ### Sean's NFL Fan Sentiment Study
 Notes:
 ***
 ### Introducing the Yogurt Data Set
 Notes:
 ***
 ### Histograms Revisited
 Notes:
 ```{r Histograms Revisited}
 ```
 ***
 ### Number of Purchases
 Notes:
 ```{r Number of Purchases}
 ```
 ***
 ### Prices over Time
 Notes:
 ```{r Prices over Time}
 ```
 ***
 ### Sampling Observations
 Notes:
 ***
 ### Looking at Samples of Households
 ```{r Looking at Sample of Households}
 ```
 ***
 ### The Limits of Cross Sectional Data
 Notes:
 ***
 ### Many Variables
 Notes:
 ***
 ### Scatterplot Matrix
 Notes:
 ***
 ### Even More Variables
 Notes:
 ***
 ### Heat Maps
 Notes:
 ```{r}
 nci <- read.table("nci.tsv")
 colnames(nci) <- c(1:64)
 ```
 ```{r}
 nci.long.samp <- melt(as.matrix(nci[1:200,]))
 names(nci.long.samp) <- c("gene", "case", "value")
 head(nci.long.samp)
 ggplot(aes(y = gene, x = case, fill = value),
  data = nci.long.samp) +
  geom_tile() +
  scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
 ```
 ***
 ### Analyzing Three of More Variables
 Reflection:
 ***
 Click **KnitHTML** to see all of your hard work and to have an html
 page of this lesson, your answers, and your notes!
--- a/lesson5/nci.tsv
+++ b/lesson5/nci.tsv
--- a/lesson5/scatterplotMatrix.pdf
+++ b/lesson5/scatterplotMatrix.pdf
--- a/lesson5/yogurt.csv
+++ b/lesson5/yogurt.csv
--- a/lesson6/diamondsbig.csv
+++ b/lesson6/diamondsbig.csv
--- a/lesson6/ggpairs_landscape.pdf
+++ b/lesson6/ggpairs_landscape.pdf
--- a/lesson6/lesson6_student.rmd
+++ b/lesson6/lesson6_student.rmd
@ -0,0 +1,289 @@
 Lesson 6
 ========================================================
 ### Welcome
 Notes:
 ***
 ### Scatterplot Review
 ```{r Scatterplot Review}
 ```
 ***
 ### Price and Carat Relationship
 Response:
 ***
 ### Frances Gerety
 Notes:
 #### A diamonds is
 ***
 ### The Rise of Diamonds
 Notes:
 ***
 ### ggpairs Function
 Notes:
 ```{r ggpairs Function}
 # install these if necessary
 install.packages('GGally')
 install.packages('scales')
 install.packages('memisc')
 install.packages('lattice')
 install.packages('MASS')
 install.packages('car')
 install.packages('reshape')
 install.packages('plyr')
 # load the ggplot graphics package and the others
 library(ggplot2)
 library(GGally)
 library(scales)
 library(memisc)
 # sample 10,000 diamonds from the data set
 set.seed(20022012)
 diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
 ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
 ```
 What are some things you notice in the ggpairs output?
 Response:
 ***
 ### The Demand of Diamonds
 Notes:
 ```{r The Demand of Diamonds}
 ```
 ***
 ### Connecting Demand and Price Distributions
 Notes:
 ***
 ### Scatterplot Transformation
 ```{r Scatterplot Transformation}
 ```
 ### Create a new function to transform the carat variable
 ```{r cuberoot transformation}
 cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)
 ```
 #### Use the cuberoot_trans function
 ```{r Use cuberoot_trans}
 ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
 ```
 ***
 ### Overplotting Revisited
 ```{r Sort and Head Tables}
 ```
 ```{r Overplotting Revisited}
 ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
 ```
 ***
 ### Other Qualitative Factors
 Notes:
 ***
 ### Price vs. Carat and Clarity
 Alter the code below.
 ```{r Price vs. Carat and Clarity}
 # install and load the RColorBrewer package
 install.packages('RColorBrewer')
 library(RColorBrewer)
 ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
 ```
 ***
 ### Clarity and Price
 Response:
 ***
 ### Price vs. Carat and Cut
 Alter the code below.
 ```{r Price vs. Carat and Cut}
 ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Clarity', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
 ```
 ***
 ### Cut and Price
 Response:
 ***
 ### Price vs. Carat and Color
 Alter the code below.
 ```{r Price vs. Carat and Color}
 ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = Cut, reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
 ```
 ***
 ### Color and Price
 Response:
 ***
 ### Linear Models in R
 Notes:
 Response:
 ***
 ### Building the Linear Model
 Notes:
 ```{r Building the Linear Model}
 m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
 m2 <- update(m1, ~ . + carat)
 m3 <- update(m2, ~ . + cut)
 m4 <- update(m3, ~ . + color)
 m5 <- update(m4, ~ . + clarity)
 mtable(m1, m2, m3, m4, m5)
 ```
 Notice how adding cut to our model does not help explain much of the variance
 in the price of diamonds. This fits with out exploration earlier.
 ***
 ### Model Problems
 Video Notes:
 Research:
 (Take some time to come up with 2-4 problems for the model)
 (You should 10-20 min on this)
 Response:
 ***
 ### A Bigger, Better Data Set
 Notes:
 ```{r A Bigger, Better Data Set}
 install.package('bitops')
 install.packages('RCurl')
 library('bitops')
 library('RCurl')
 diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
 load(rawConnection(diamondsurl))
 ```
 The code used to obtain the data is available here:
 https://github.com/solomonm/diamonds-data
 ## Building a Model Using the Big Diamonds Data Set
 Notes:
 ```{r Building a Model Using the Big Diamonds Data Set}
 ```
 ***
 ## Predictions
 Example Diamond from BlueNile:
 Round 1.00 Very Good I VS1 $5,601
 ```{r}
 #Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
 thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
 modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
 ```
 Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
 ***
 ## Final Thoughts
 Notes:
 ***
 Click **KnitHTML** to see all of your hard work and to have an html
 page of this lesson, your answers, and your notes!