Lesson 6 ======================================================== ### Welcome Notes: *** ### Scatterplot Review ```{r Scatterplot Review} library(ggplot2) data(diamonds) ggplot(aes(x = carat, y = price), data = diamonds) + geom_point() + xlim(0, quantile(diamonds$carat, 0.99)) + ylim(0, quantile(diamonds$price, 0.99)) ``` *** ### Price and Carat Relationship Response: The price increases as the carat increases but it also gains more variability ### Frances Gerety Notes: #### A diamonds is Forever ### The Rise of Diamonds Notes: *** ### ggpairs Function Notes: ```{r ggpairs Function} # install these if necessary #install.packages('GGally') #install.packages('scales') #install.packages('memisc') #install.packages('lattice') #install.packages('MASS') #install.packages('car') #install.packages('reshape') #install.packages('plyr') # load the ggplot graphics package and the others library(ggplot2) library(GGally) library(scales) library(memisc) # sample 10,000 diamonds from the data set set.seed(20022012) diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ] ggpairs(diamond_samp, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.')))) ``` What are some things you notice in the ggpairs output? Response: There seems to be some Clarity and Colors that draw a higher price but besides that the size seems to have the largest correlation. ### The Demand of Diamonds Notes: ```{r The Demand of Diamonds} library(gridExtra) plot1 <- ggplot(aes(x = price), data = diamonds) + geom_histogram() + ggtitle('Price') plot2 <- ggplot(aes(x = price), data = diamonds) + geom_histogram() + scale_x_log10() + ggtitle('Price (log10)') grid.arrange(plot1, plot2, ncol = 2) ``` *** ### Connecting Demand and Price Distributions Notes: There are 2 categories of diamond buyers that are looking for different types ### Scatterplot Transformation ```{r Scatterplot Transformation} ``` ### Create a new function to transform the carat variable ```{r cuberoot transformation} cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), inverse = function(x) x^3) ``` #### Use the cuberoot_trans function ```{r Use cuberoot_trans} ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat') ``` *** ### Overplotting Revisited ```{r Sort and Head Tables} ``` ```{r Overplotting Revisited} ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat') ``` *** ### Other Qualitative Factors Notes: *** ### Price vs. Carat and Clarity Alter the code below. ```{r Price vs. Carat and Clarity} # install and load the RColorBrewer package install.packages('RColorBrewer') library(RColorBrewer) ggplot(aes(x = carat, y = price), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Clarity', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Clarity') ``` *** ### Clarity and Price Response: *** ### Price vs. Carat and Cut Alter the code below. ```{r Price vs. Carat and Cut} ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Clarity', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Clarity') ``` *** ### Cut and Price Response: *** ### Price vs. Carat and Color Alter the code below. ```{r Price vs. Carat and Color} ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = Cut, reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Cut') ``` *** ### Color and Price Response: *** ### Linear Models in R Notes: Response: *** ### Building the Linear Model Notes: ```{r Building the Linear Model} m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds) m2 <- update(m1, ~ . + carat) m3 <- update(m2, ~ . + cut) m4 <- update(m3, ~ . + color) m5 <- update(m4, ~ . + clarity) mtable(m1, m2, m3, m4, m5) ``` Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier. *** ### Model Problems Video Notes: Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this) Response: *** ### A Bigger, Better Data Set Notes: ```{r A Bigger, Better Data Set} install.package('bitops') install.packages('RCurl') library('bitops') library('RCurl') diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda") load(rawConnection(diamondsurl)) ``` The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data ## Building a Model Using the Big Diamonds Data Set Notes: ```{r Building a Model Using the Big Diamonds Data Set} ``` *** ## Predictions Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601 ```{r} #Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace. thisDiamond = data.frame(carat = 1.00, cut = "V.Good", color = "I", clarity="VS1") modelEstimate = predict(m5, newdata = thisDiamond, interval="prediction", level = .95) ``` Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI. *** ## Final Thoughts Notes: *** Click **KnitHTML** to see all of your hard work and to have an html page of this lesson, your answers, and your notes!