Lesson 6 ======================================================== ### Welcome Notes: ```{r Setup} # install these if necessary #install.packages('GGally') #install.packages('scales') #install.packages('memisc') #install.packages('lattice') #install.packages('MASS') #install.packages('car') #install.packages('reshape') #install.packages('plyr') #install.packages('RColorBrewer', dependencies = TRUE) #install.packages('bitops') #install.packages('RCurl') # load the ggplot graphics package and the others library(ggplot2) library(GGally) library(scales) library(memisc) library(gridExtra) library(RColorBrewer) library(bitops) library(RCurl) data(diamonds) cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), inverse = function(x) x^3) ``` ### Scatterplot Review ```{r Scatterplot Review} ggplot(aes(x = carat, y = price), data = diamonds) + geom_point() + xlim(0, quantile(diamonds$carat, 0.99)) + ylim(0, quantile(diamonds$price, 0.99)) ``` *** ### Price and Carat Relationship Response: The price increases as the carat increases but it also gains more variability ### Frances Gerety Notes: #### A diamonds is Forever ### The Rise of Diamonds Notes: *** ### ggpairs Function Notes: ```{r ggpairs Function} # sample 10,000 diamonds from the data set set.seed(20022012) diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ] ggpairs(diamond_samp, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.')))) ``` What are some things you notice in the ggpairs output? Response: There seems to be some Clarity and Colors that draw a higher price but besides that the size seems to have the largest correlation. ### The Demand of Diamonds Notes: ```{r The Demand of Diamonds} plot1 <- ggplot(aes(x = price), data = diamonds) + geom_histogram() + ggtitle('Price') plot2 <- ggplot(aes(x = price), data = diamonds) + geom_histogram() + scale_x_log10() + ggtitle('Price (log10)') grid.arrange(plot1, plot2, ncol = 2) ``` *** ### Connecting Demand and Price Distributions Notes: There are 2 categories of diamond buyers that are looking for different types ### Scatterplot Transformation ```{r Scatterplot Transformation} ``` ### Create a new function to transform the carat variable ```{r cuberoot transformation} ``` #### Use the cuberoot_trans function ```{r Use cuberoot_trans} ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat') ``` *** ### Overplotting Revisited ```{r Sort and Head Tables} head(sort(table(diamonds$carat), decreasing = T)) head(sort(table(diamonds$price), decreasing = T)) ``` ```{r Overplotting Revisited} ggplot(aes(carat, price), data = diamonds) + geom_point(alpha = 1/2, size = 3/4, position = 'jitter') + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat') ``` *** ### Other Qualitative Factors Notes: *** ### Price vs. Carat and Clarity Alter the code below. ```{r Price vs. Carat and Clarity} ggplot(aes(x = carat, y = price), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter', aes(color = diamonds$clarity)) + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Clarity', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Clarity') ``` *** ### Clarity and Price Response: No, I don't think it explains the difference because the clarity differential seems to parallel the price/carat relationship. This question is worded poorly. For the answer they wanted the question should be "Do you think clarity explains some of the variance in price for a given carat?" ### Price vs. Carat and Cut Alter the code below. ```{r Price vs. Carat and Cut} ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Cut', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Cut') ``` *** ### Cut and Price Response: Some, but not much. The better cuts tend to have a higher price but some of the worse cuts have a higher price than the better cuts. ### Price vs. Carat and Color Alter the code below. ```{r Price vs. Carat and Color} ggplot(aes(x = carat, y = price, color = color), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Color', reverse = FALSE, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Color') ``` *** ### Color and Price Response: Yes, same reason as Clarity ### Linear Models in R Notes: Response: *** ### Building the Linear Model Notes: ```{r Building the Linear Model} m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds) m2 <- update(m1, ~ . + carat) m3 <- update(m2, ~ . + cut) m4 <- update(m3, ~ . + color) m5 <- update(m4, ~ . + clarity) mtable(m1, m2, m3, m4, m5, sdigits = 3) ``` Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier. *** ### Model Problems Video Notes: Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this) Response: The main problem I see is the problem of supply and demand over time. As demand goes up and supply goes down the price goes up, as supply goes up or demand goes down the price goes down. Another problem is where on the supply chain the diamond is. Is it a rough cut, or polished? Is it wholesale or go through a distributor? ### A Bigger, Better Data Set Notes: ```{r A Bigger, Better Data Set} #diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda") #load(rawConnection(diamondsurl)) load("BigDiamonds.rda") m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig) m2 <- update(m1, ~ . + carat) m3 <- update(m2, ~ . + cut) m4 <- update(m3, ~ . + color) m5 <- update(m4, ~ . + clarity) mtable(m1, m2, m3, m4, m5, sdigits = 3) ``` The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data ## Building a Model Using the Big Diamonds Data Set Notes: ```{r Building a Model Using the Big Diamonds Data Set} ``` *** ## Predictions Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601 ```{r} #Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace. thisDiamond = data.frame(carat = 1.00, cut = "V.Good", color = "I", clarity="VS1") modelEstimate = predict(m5, newdata = thisDiamond, interval="prediction", level = .95) exp(modelEstimate) ``` Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI. The 95% CI is very large but does encompass the correct value. The fit is relatively far off from the actual value at over a 10% difference between the actual value and predicted value. ## Final Thoughts Notes: *** Click **KnitHTML** to see all of your hard work and to have an html page of this lesson, your answers, and your notes!