Lesson 6 complete

This commit is contained in:
Dusty.P 2018-05-31 23:04:57 -08:00
parent 9945ad26ef
commit de414a2e37
2 changed files with 74 additions and 66 deletions

BIN
lesson6/BigDiamonds.Rda Normal file

Binary file not shown.

View File

@ -4,13 +4,39 @@ Lesson 6
### Welcome
Notes:
***
```{r Setup}
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
#install.packages('RColorBrewer', dependencies = TRUE)
#install.packages('bitops')
#install.packages('RCurl')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
library(RColorBrewer)
library(bitops)
library(RCurl)
data(diamonds)
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
### Scatterplot Review
```{r Scatterplot Review}
library(ggplot2)
data(diamonds)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
xlim(0, quantile(diamonds$carat, 0.99)) +
@ -39,22 +65,6 @@ Notes:
Notes:
```{r ggpairs Function}
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
@ -72,8 +82,6 @@ There seems to be some Clarity and Colors that draw a higher price but besides t
Notes:
```{r The Demand of Diamonds}
library(gridExtra)
plot1 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram() +
ggtitle('Price')
@ -103,8 +111,6 @@ There are 2 categories of diamond buyers that are looking for different types
### Create a new function to transform the carat variable
```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
#### Use the cuberoot_trans function
@ -123,13 +129,14 @@ ggplot(aes(carat, price), data = diamonds) +
### Overplotting Revisited
```{r Sort and Head Tables}
head(sort(table(diamonds$carat), decreasing = T))
head(sort(table(diamonds$price), decreasing = T))
```
```{r Overplotting Revisited}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
geom_point(alpha = 1/2, size = 3/4, position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
@ -148,12 +155,8 @@ Notes:
Alter the code below.
```{r Price vs. Carat and Clarity}
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
geom_point(alpha = 0.5, size = 1, position = 'jitter', aes(color = diamonds$clarity)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
@ -169,39 +172,18 @@ ggplot(aes(x = carat, y = price), data = diamonds) +
### Clarity and Price
Response:
***
No, I don't think it explains the difference because the clarity differential seems to parallel the price/carat relationship.
This question is worded poorly. For the answer they wanted the question should be "Do you think clarity explains some of the variance in price for a given carat?"
### Price vs. Carat and Cut
Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Cut and Price
Response:
***
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = Cut, reverse = T,
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
@ -212,10 +194,33 @@ ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
***
### Cut and Price
Response:
Some, but not much. The better cuts tend to have a higher price but some of the worse cuts have a higher price than the better cuts.
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
```
***
### Color and Price
Response:
***
Yes, same reason as Clarity
### Linear Models in R
Notes:
@ -233,7 +238,7 @@ m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
```
Notice how adding cut to our model does not help explain much of the variance
@ -250,19 +255,21 @@ Research:
Response:
***
The main problem I see is the problem of supply and demand over time. As demand goes up and supply goes down the price goes up, as supply goes up or demand goes down the price goes down. Another problem is where on the supply chain the diamond is. Is it a rough cut, or polished? Is it wholesale or go through a distributor?
### A Bigger, Better Data Set
Notes:
```{r A Bigger, Better Data Set}
install.package('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
```
The code used to obtain the data is available here:
@ -289,11 +296,12 @@ thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate)
```
Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
***
The 95% CI is very large but does encompass the correct value. The fit is relatively far off from the actual value at over a 10% difference between the actual value and predicted value.
## Final Thoughts
Notes: