udacity_eda/lesson6/lesson6_student.rmd
2018-04-17 19:56:59 -08:00

290 lines
6.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Lesson 6
========================================================
### Welcome
Notes:
***
### Scatterplot Review
```{r Scatterplot Review}
```
***
### Price and Carat Relationship
Response:
***
### Frances Gerety
Notes:
#### A diamonds is
***
### The Rise of Diamonds
Notes:
***
### ggpairs Function
Notes:
```{r ggpairs Function}
# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
```
What are some things you notice in the ggpairs output?
Response:
***
### The Demand of Diamonds
Notes:
```{r The Demand of Diamonds}
```
***
### Connecting Demand and Price Distributions
Notes:
***
### Scatterplot Transformation
```{r Scatterplot Transformation}
```
### Create a new function to transform the carat variable
```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
#### Use the cuberoot_trans function
```{r Use cuberoot_trans}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### Overplotting Revisited
```{r Sort and Head Tables}
```
```{r Overplotting Revisited}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### Other Qualitative Factors
Notes:
***
### Price vs. Carat and Clarity
Alter the code below.
```{r Price vs. Carat and Clarity}
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Clarity and Price
Response:
***
### Price vs. Carat and Cut
Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Cut and Price
Response:
***
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = Cut, reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
```
***
### Color and Price
Response:
***
### Linear Models in R
Notes:
Response:
***
### Building the Linear Model
Notes:
```{r Building the Linear Model}
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```
Notice how adding cut to our model does not help explain much of the variance
in the price of diamonds. This fits with out exploration earlier.
***
### Model Problems
Video Notes:
Research:
(Take some time to come up with 2-4 problems for the model)
(You should 10-20 min on this)
Response:
***
### A Bigger, Better Data Set
Notes:
```{r A Bigger, Better Data Set}
install.package('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))
```
The code used to obtain the data is available here:
https://github.com/solomonm/diamonds-data
## Building a Model Using the Big Diamonds Data Set
Notes:
```{r Building a Model Using the Big Diamonds Data Set}
```
***
## Predictions
Example Diamond from BlueNile:
Round 1.00 Very Good I VS1 $5,601
```{r}
#Be sure youve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
```
Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
***
## Final Thoughts
Notes:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!