udacity_eda/lesson6/lesson6_student.rmd
2018-05-31 23:04:57 -08:00

314 lines
8.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Lesson 6
========================================================
### Welcome
Notes:
```{r Setup}
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
#install.packages('RColorBrewer', dependencies = TRUE)
#install.packages('bitops')
#install.packages('RCurl')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
library(RColorBrewer)
library(bitops)
library(RCurl)
data(diamonds)
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
### Scatterplot Review
```{r Scatterplot Review}
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
xlim(0, quantile(diamonds$carat, 0.99)) +
ylim(0, quantile(diamonds$price, 0.99))
```
***
### Price and Carat Relationship
Response:
The price increases as the carat increases but it also gains more variability
### Frances Gerety
Notes:
#### A diamonds is
Forever
### The Rise of Diamonds
Notes:
***
### ggpairs Function
Notes:
```{r ggpairs Function}
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
```
What are some things you notice in the ggpairs output?
Response:
There seems to be some Clarity and Colors that draw a higher price but besides that the size seems to have the largest correlation.
### The Demand of Diamonds
Notes:
```{r The Demand of Diamonds}
plot1 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram() +
ggtitle('Price')
plot2 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram() +
scale_x_log10() +
ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol = 2)
```
***
### Connecting Demand and Price Distributions
Notes:
There are 2 categories of diamond buyers that are looking for different types
### Scatterplot Transformation
```{r Scatterplot Transformation}
```
### Create a new function to transform the carat variable
```{r cuberoot transformation}
```
#### Use the cuberoot_trans function
```{r Use cuberoot_trans}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### Overplotting Revisited
```{r Sort and Head Tables}
head(sort(table(diamonds$carat), decreasing = T))
head(sort(table(diamonds$price), decreasing = T))
```
```{r Overplotting Revisited}
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 1/2, size = 3/4, position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### Other Qualitative Factors
Notes:
***
### Price vs. Carat and Clarity
Alter the code below.
```{r Price vs. Carat and Clarity}
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter', aes(color = diamonds$clarity)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Clarity and Price
Response:
No, I don't think it explains the difference because the clarity differential seems to parallel the price/carat relationship.
This question is worded poorly. For the answer they wanted the question should be "Do you think clarity explains some of the variance in price for a given carat?"
### Price vs. Carat and Cut
Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
```
***
### Cut and Price
Response:
Some, but not much. The better cuts tend to have a higher price but some of the worse cuts have a higher price than the better cuts.
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
```
***
### Color and Price
Response:
Yes, same reason as Clarity
### Linear Models in R
Notes:
Response:
***
### Building the Linear Model
Notes:
```{r Building the Linear Model}
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
```
Notice how adding cut to our model does not help explain much of the variance
in the price of diamonds. This fits with out exploration earlier.
***
### Model Problems
Video Notes:
Research:
(Take some time to come up with 2-4 problems for the model)
(You should 10-20 min on this)
Response:
The main problem I see is the problem of supply and demand over time. As demand goes up and supply goes down the price goes up, as supply goes up or demand goes down the price goes down. Another problem is where on the supply chain the diamond is. Is it a rough cut, or polished? Is it wholesale or go through a distributor?
### A Bigger, Better Data Set
Notes:
```{r A Bigger, Better Data Set}
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
```
The code used to obtain the data is available here:
https://github.com/solomonm/diamonds-data
## Building a Model Using the Big Diamonds Data Set
Notes:
```{r Building a Model Using the Big Diamonds Data Set}
```
***
## Predictions
Example Diamond from BlueNile:
Round 1.00 Very Good I VS1 $5,601
```{r}
#Be sure youve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate)
```
Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
The 95% CI is very large but does encompass the correct value. The fit is relatively far off from the actual value at over a 10% difference between the actual value and predicted value.
## Final Thoughts
Notes:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!