diff --git a/EDA_Project/EDA_Project.html b/EDA_Project/EDA_Project.html index ca9ab21..1567aeb 100644 --- a/EDA_Project/EDA_Project.html +++ b/EDA_Project/EDA_Project.html @@ -179,7 +179,7 @@ $(document).ready(function () {

The distribution of the quality seems fairly normal with a peak at 6

The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.

-

+

The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
@@ -189,12 +189,12 @@ $(document).ready(function () {
 

And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.

We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.

-

+

There is an odd spike at about 0.49 I might want to look into that more later.

Even with the top and bottom 1% removed the plot is still very long tailed

-

+

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 ##   0.600   1.700   5.200   6.391   9.900  65.800

Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. This is probably is because it is harder to measure the residual sugar as a continuous scale and so the steps are more apparent at the lower, more spread out, values. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.

@@ -235,22 +235,22 @@ of the data? If so, why did you do this?

Bivariate Plots Section

-

+

Looking at this grid there are not very many variables that appear to correlate with each other which I find suprising as some of them seem like they should. I am going to explore some of them in more detail.

One pair of variables that look like they have some correlation are density and residual sugar so I am going to start with them.

Narrowing in on the main section and adding a smoothing line.

We can see a general trend as residual sugar increases the density also increases. Lets see both of these plotted against our output variable.

-

+

There doesn’t seem to be any direct corelation between these variables and the quality. Lets look at some others.

-

+

Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak.

-

-

-

+

+

+

Looking at these other variables shows that there is little to no relationship to the quality individually I think this will change when we start combining variables in the Multivariate plots.

One other interesting corelation that I want to look at is density vs alcohol.

@@ -283,71 +283,135 @@ of the data? If so, why did you do this?

Multivariate Plots Section

-

-

-

-

-

-

-

This is really the only plot that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density (which we discovered an inverse corelation between alcohol and density earlier so that should make sense).

+

Since there seems to be a relationship bewteen alcohol and chlorides as well as chlorides and quality lets take a look at that relationship first.

+

+

I find this to be suprising. I expected at least a mild distiction in this plot but it only shows a general trend that the higher the alcohol the more likely to have a higher quality but there isn’t anything here we can use to make accurate predictions.

+

Lets take a look at some other relationships we identified earlier.

+

+

Again just a higher chance for a higher quality as the alcohol increases. It doesn’t look like the residual sugar plays into it much at all.

+

+

There is no real discinction here, possibly a slightly higher chance for high quality at a lower density. But apparently pH doesn’t matter at all.

+

+

It looks like there might be a trend towards lower fixed acidity. I wonder about a combination of fixed and volatile acidity when combined with alcohol.

+

+

Doesn’t really appear to be any different than just alcohol content. There might be a slight trend towards lower acidity.

+

+

These last two plots are really the only ones that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density or lower acidity.

Lets see if a linear model can make any predictions.

## 
 ## Calls:
-## m1: lm(formula = I(quality) ~ I(alcohol), data = wqw)
-## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wqw)
-## m5: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
-##     chlorides + sulphates, data = wqw)
-## m6: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
-##     chlorides + sulphates + pH, data = wqw)
-## m9: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
-##     chlorides + sulphates + pH + fixed.acidity + volatile.acidity + 
-##     citric.acid, data = wqw)
-## m11: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
-##     chlorides + sulphates + pH + fixed.acidity + volatile.acidity + 
-##     citric.acid + free.sulfur.dioxide + total.sulfur.dioxide, 
-##     data = wqw)
+## m1: lm(formula = quality ~ alcohol, data = wqw)
+## m2: lm(formula = quality ~ alcohol + density, data = wqw)
+## m5: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates, data = wqw)
+## m7: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates + pH + fixed.acidity + volatile.acidity, data = wqw)
+## m9: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates + pH + fixed.acidity + volatile.acidity + citric.acid + 
+##     free.sulfur.dioxide, data = wqw)
+## m10: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates + pH + fixed.acidity + volatile.acidity + citric.acid + 
+##     free.sulfur.dioxide + total.sulfur.dioxide, data = wqw)
 ## 
 ## ============================================================================================================
-##                              m1            m2            m5            m6            m9           m11       
+##                              m1            m2            m5            m7            m9           m10       
 ## ------------------------------------------------------------------------------------------------------------
-##   (Intercept)               2.582***    -22.492***    112.492***    134.445***    157.665***    150.193***  
-##                            (0.098)       (6.165)      (12.783)      (13.137)      (18.458)      (18.804)    
-##   I(alcohol)                0.313***      0.360***      0.209***      0.179***      0.182***      0.193***  
-##                            (0.009)       (0.015)       (0.019)       (0.019)       (0.024)       (0.024)    
-##   density                                24.728***   -110.148***   -133.690***   -157.700***   -150.284***  
-##                                          (6.079)      (12.743)      (13.159)      (18.725)      (19.075)    
-##   residual.sugar                                        0.061***      0.073***      0.087***      0.081***  
-##                                                        (0.005)       (0.006)       (0.007)       (0.008)    
-##   chlorides                                            -1.724**      -1.388*       -0.134        -0.247     
-##                                                        (0.552)       (0.552)       (0.547)       (0.547)    
-##   sulphates                                             0.749***      0.692***      0.658***      0.631***  
-##                                                        (0.102)       (0.102)       (0.100)       (0.100)    
-##   pH                                                                  0.532***      0.714***      0.686***  
-##                                                                      (0.079)       (0.105)       (0.105)    
-##   fixed.acidity                                                                     0.063**       0.066**   
-##                                                                                    (0.021)       (0.021)    
-##   volatile.acidity                                                                 -1.930***     -1.863***  
-##                                                                                    (0.111)       (0.114)    
-##   citric.acid                                                                       0.055         0.022     
+##   (Intercept)               2.582***    -22.492***    112.492***    156.891***    152.979***    150.193***  
+##                            (0.098)       (6.165)      (12.783)      (18.407)      (18.439)      (18.804)    
+##   alcohol                   0.313***      0.360***      0.209***      0.183***      0.193***      0.193***  
+##                            (0.009)       (0.015)       (0.019)       (0.024)       (0.024)       (0.024)    
+##   density                                24.728***   -110.148***   -156.909***   -153.111***   -150.284***  
+##                                          (6.079)      (12.743)      (18.673)      (18.704)      (19.075)    
+##   residual.sugar                                        0.061***      0.087***      0.082***      0.081***  
+##                                                        (0.005)       (0.007)       (0.007)       (0.008)    
+##   chlorides                                            -1.724**      -0.099        -0.251        -0.247     
+##                                                        (0.552)       (0.544)       (0.546)       (0.547)    
+##   sulphates                                             0.749***      0.661***      0.626***      0.631***  
+##                                                        (0.102)       (0.100)       (0.100)       (0.100)    
+##   pH                                                                  0.709***      0.688***      0.686***  
+##                                                                      (0.105)       (0.105)       (0.105)    
+##   fixed.acidity                                                       0.065**       0.066**       0.066**   
+##                                                                      (0.021)       (0.021)       (0.021)    
+##   volatile.acidity                                                   -1.942***     -1.880***     -1.863***  
+##                                                                      (0.110)       (0.112)       (0.114)    
+##   citric.acid                                                                       0.019         0.022     
 ##                                                                                    (0.096)       (0.096)    
-##   free.sulfur.dioxide                                                                             0.004***  
-##                                                                                                  (0.001)    
+##   free.sulfur.dioxide                                                               0.003***      0.004***  
+##                                                                                    (0.001)       (0.001)    
 ##   total.sulfur.dioxide                                                                           -0.000     
 ##                                                                                                  (0.000)    
 ## ------------------------------------------------------------------------------------------------------------
-##   R-squared                 0.190         0.192         0.220         0.228         0.278         0.282     
-##   adj. R-squared            0.190         0.192         0.220         0.227         0.277         0.280     
-##   sigma                     0.797         0.796         0.782         0.779         0.753         0.751     
-##   F                      1146.395       583.290       276.676       240.191       209.335       174.344     
+##   R-squared                 0.190         0.192         0.220         0.278         0.282         0.282     
+##   adj. R-squared            0.190         0.192         0.220         0.277         0.280         0.280     
+##   sigma                     0.797         0.796         0.782         0.753         0.751         0.751     
+##   F                      1146.395       583.290       276.676       235.493       191.738       174.344     
 ##   p                         0.000         0.000         0.000         0.000         0.000         0.000     
-##   Log-likelihood        -5839.391     -5831.127     -5744.736     -5722.182     -5556.206     -5543.740     
-##   Deviance               3112.257      3101.773      2994.261      2966.812      2772.404      2758.329     
-##   AIC                   11684.782     11670.255     11503.472     11460.364     11134.411     11113.480     
-##   BIC                   11704.272     11696.241     11548.948     11512.336     11205.874     11197.936     
+##   Log-likelihood        -5839.391     -5831.127     -5744.736     -5556.370     -5544.026     -5543.740     
+##   Deviance               3112.257      3101.773      2994.261      2772.590      2758.651      2758.329     
+##   AIC                   11684.782     11670.255     11503.472     11132.740     11112.053     11113.480     
+##   BIC                   11704.272     11696.241     11548.948     11197.706     11190.012     11197.936     
 ##   N                      4898          4898          4898          4898          4898          4898         
 ## ============================================================================================================
-

As we can see even when taking into account every feature the R-squared is still only 0.282 which is dismal at best and indicates that we can not make any predictions based on the data that we have.

-

(I had to remove some of the intermediary steps to make it fit on the page.)

+

+

Looking at the residuals plots there appears to be one outlier that could be effecting the output of the model so I am going to remove that datapoint and re-run the model.

+
## 
+## Calls:
+## m1: lm(formula = quality ~ alcohol, data = wqw.new)
+## m2: lm(formula = quality ~ alcohol + density, data = wqw.new)
+## m5: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates, data = wqw.new)
+## m7: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates + pH + fixed.acidity + volatile.acidity, data = wqw.new)
+## m9: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates + pH + fixed.acidity + volatile.acidity + citric.acid + 
+##     free.sulfur.dioxide, data = wqw.new)
+## m10: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides + 
+##     sulphates + pH + fixed.acidity + volatile.acidity + citric.acid + 
+##     free.sulfur.dioxide + total.sulfur.dioxide, data = wqw.new)
+## 
+## ============================================================================================================
+##                              m1            m2            m5            m7            m9           m10       
+## ------------------------------------------------------------------------------------------------------------
+##   (Intercept)               2.582***    -27.042***    121.165***    212.207***    211.030***    211.424***  
+##                            (0.098)       (6.608)      (13.972)      (21.975)      (21.978)      (22.737)    
+##   alcohol                   0.314***      0.369***      0.196***      0.112***      0.119***      0.119***  
+##                            (0.009)       (0.015)       (0.020)       (0.029)       (0.029)       (0.029)    
+##   density                                29.213***   -118.763***   -212.821***   -211.794***   -212.194***  
+##                                          (6.516)      (13.921)      (22.268)      (22.271)      (23.040)    
+##   residual.sugar                                        0.064***      0.105***      0.101***      0.101***  
+##                                                        (0.005)       (0.008)       (0.008)       (0.009)    
+##   chlorides                                            -1.736**       0.073        -0.080        -0.080     
+##                                                        (0.552)       (0.544)       (0.546)       (0.546)    
+##   sulphates                                             0.762***      0.734***      0.702***      0.702***  
+##                                                        (0.102)       (0.101)       (0.101)       (0.101)    
+##   pH                                                                  0.882***      0.869***      0.869***  
+##                                                                      (0.111)       (0.112)       (0.112)    
+##   fixed.acidity                                                       0.107***      0.111***      0.111***  
+##                                                                      (0.023)       (0.023)       (0.023)    
+##   volatile.acidity                                                   -1.939***     -1.874***     -1.875***  
+##                                                                      (0.109)       (0.111)       (0.114)    
+##   citric.acid                                                                       0.025         0.025     
+##                                                                                    (0.095)       (0.096)    
+##   free.sulfur.dioxide                                                               0.003***      0.003***  
+##                                                                                    (0.001)       (0.001)    
+##   total.sulfur.dioxide                                                                            0.000     
+##                                                                                                  (0.000)    
+## ------------------------------------------------------------------------------------------------------------
+##   R-squared                 0.190         0.193         0.221         0.281         0.285         0.285     
+##   adj. R-squared            0.190         0.193         0.220         0.280         0.284         0.284     
+##   sigma                     0.797         0.796         0.782         0.752         0.750         0.750     
+##   F                      1146.259       585.416       277.220       239.085       194.942       177.184     
+##   p                         0.000         0.000         0.000         0.000         0.000         0.000     
+##   Log-likelihood        -5838.650     -5828.614     -5742.882     -5545.220     -5531.742     -5531.740     
+##   Deviance               3112.194      3099.464      2992.817      2760.709      2745.554      2745.551     
+##   AIC                   11683.300     11665.228     11499.764     11110.440     11087.484     11089.480     
+##   BIC                   11702.789     11691.214     11545.238     11175.404     11165.441     11173.933     
+##   N                      4897          4897          4897          4897          4897          4897         
+## ============================================================================================================
+

+

We got a very slight increase to the model but not very much and it looks like we got rid of all the major outliers.

+

As we can see even when taking into account every feature and removing the outlier the R-squared is still only 0.285 which is dismal at best and indicates that we can not make any predictions based on the data that we have.

+

(I had to remove some of the intermediary steps to make the model fit on the page.)

Multivariate Analysis

@@ -373,7 +437,7 @@ strengths and limitations of your model.

Final Plots and Summary

Plot One

-

+

Description One

@@ -381,19 +445,19 @@ strengths and limitations of your model.

Plot Two

-

+

Description Two

-

Alcohol content is the closest that any of the features came to corelating with the quality and here you can see that even that corelation is very weak. The only thing we can tell is that more higher quality wines had a higher alcohol content than lower quality, but the spread on the data makes this a very weak corelation at 0.436.

+

The only distiction I was able to discover was based on alcohol content and it is very slight at best. It does appear that a higher alcohol content increases the chance of a higher quality product but there is no clear distinction that can be seen. While the high quality products mostly have a higher alcohol content and low quality products have lower alcohol content the mid range products span the whole spectrum. Based on this it would be hard to determine the difference between a 6, 7, 8, or 9 quality based on the data provided. But you could probably tell the difference between a 4 and an 8.

Plot Three

-

+

Description Three

-

I include this plot just to show how there is no clear distinction in the quality when compared to the features of the data. This is representative of all of the plots I made in the multivariate section.

+

After some research it appears that the pattern shown in the Residuals vs. Fitted plot is most likely due to the fact that our dependent variable has only a few possible values. The patterns in the Scale-Location could indicate that a linear model is not the best for our data.


diff --git a/EDA_Project/EDA_Project.rmd b/EDA_Project/EDA_Project.rmd index 5182925..9aace5e 100644 --- a/EDA_Project/EDA_Project.rmd +++ b/EDA_Project/EDA_Project.rmd @@ -6,7 +6,8 @@ output: html_document --- ```{r echo=FALSE, message=FALSE, warning=FALSE, setup} -knitr::opts_knit$set(root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project")) +knitr::opts_knit$set( + root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project")) # load the ggplot graphics package and the others library(ggplot2) @@ -18,7 +19,8 @@ library(RColorBrewer) library(bitops) library(RCurl) -cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), +cuberoot_trans = function() trans_new('cuberoot', + transform = function(x) x^(1/3), inverse = function(x) x^3) ``` @@ -66,7 +68,7 @@ The Alcohol seems to be slightly long tailed, I want to see what it is like with ```{r echo=FALSE, warning=FALSE, alcohol_histogram_log} ggplot(aes(x = alcohol), data = wqw) + geom_histogram(binwidth = .005) + - scale_x_log10() + scale_x_log10(breaks = c(8, 9, 10, 11, 12, 13, 14)) ``` ```{r echo=FALSE, warning=FALSE, fixed.acidity_histogram} @@ -101,7 +103,7 @@ We have another long tailed distribution. I am going to plot again with a log_10 ```{r echo=FALSE, warning=FALSE, volatile.acidity_histogram_log} ggplot(aes(x = volatile.acidity), data = wqw) + geom_histogram(binwidth = .04) + - scale_x_log10() + scale_x_log10(breaks = seq(0.1, 1.0, 0.1)) ``` ```{r echo=FALSE, warning=FALSE, citric.acid_histogram} @@ -123,10 +125,10 @@ Even with the top and bottom 1% removed the plot is still very long tailed ```{r echo=FALSE, warning=FALSE, residual.sugar_histogram_log} p1 <- ggplot(aes(x = residual.sugar), data = wqw) + geom_histogram(binwidth = .05) + - scale_x_log10() + scale_x_log10(breaks = c(0, 1, 2, 4, 6, 8, 12, 16, 20, 40, 65)) p2 <- ggplot(aes(x = residual.sugar), data = wqw) + geom_histogram(binwidth = .01) + - scale_x_log10(breaks = seq(0, 20, 2)) + scale_x_log10(breaks = c(0, 1, 2, 4, 6, 8, 12, 16, 20, 40, 65)) grid.arrange(p1, p2) ``` @@ -150,7 +152,8 @@ p1 <- ggplot(aes(x = free.sulfur.dioxide), data = wqw) + xlim(0, quantile(wqw$free.sulfur.dioxide, 0.99)) p2 <- ggplot(aes(x = total.sulfur.dioxide), data = wqw) + geom_histogram(binwidth = 1) + - xlim(quantile(wqw$total.sulfur.dioxide, 0.01), quantile(wqw$total.sulfur.dioxide, 0.99)) + xlim(quantile(wqw$total.sulfur.dioxide, 0.01), + quantile(wqw$total.sulfur.dioxide, 0.99)) grid.arrange(p1, p2) ``` @@ -207,8 +210,10 @@ I either log transformed or removed the outliers on most of the datapoints to be # Bivariate Plots Section -```{r echo=FALSE, warning=FALSE, Bivariate_Plots} -ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)), lower = list(continuous = wrap("smooth", alpha=0.2, color = "orange"))) + +```{r echo=FALSE, warning=FALSE, fig.width=10, fig.height=10, Bivariate_Plots} +ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)), + lower = list(continuous = wrap("smooth", alpha=0.2, + color = "orange"))) + theme_grey(base_size = 6) ``` @@ -239,8 +244,10 @@ ggplot(aes(x = quality, y = density), data = wqw) + geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") + geom_line(stat = 'summary', fun.y = mean, color = "blue") + geom_line(stat = 'summary', fun.y = median) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) + geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), + color = 'red', linetype = 2) + + geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), + color = 'red', linetype = 2) ``` ```{r echo=FALSE, warning=FALSE, quality_vs_residual.sugar} @@ -262,8 +269,10 @@ ggplot(aes(x = quality, y = alcohol), data = wqw) + geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") + geom_line(stat = 'summary', fun.y = mean, color = "blue") + geom_line(stat = 'summary', fun.y = median) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) + geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), + color = 'red', linetype = 2) + + geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), + color = 'red', linetype = 2) ``` Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak. @@ -279,8 +288,10 @@ ggplot(aes(x = quality, y = chlorides), data = wqw) + ylim(0, 0.1) + geom_line(stat = 'summary', fun.y = mean, color = "blue") + geom_line(stat = 'summary', fun.y = median) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) + geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), + color = 'red', linetype = 2) + + geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), + color = 'red', linetype = 2) ``` ```{r echo=FALSE, warning=FALSE, quality_vs_tsd} @@ -358,62 +369,118 @@ By far the strongest relationship I found was between density and residual sugar # Multivariate Plots Section +Since there seems to be a relationship bewteen alcohol and chlorides as well as chlorides and quality lets take a look at that relationship first. + ```{r echo=FALSE, warning=FALSE, alcohol_chlorides_quality} ggplot(aes(x = alcohol, y = chlorides), data = wqw) + - geom_point(aes(color = quality)) + geom_point(aes(color = factor(quality))) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() ``` +I find this to be suprising. I expected at least a mild distiction in this plot but it only shows a general trend that the higher the alcohol the more likely to have a higher quality but there isn't anything here we can use to make accurate predictions. + +Lets take a look at some other relationships we identified earlier. + ```{r echo=FALSE, warning=FALSE, alcohol_residual.sugar_quality} ggplot(aes(x = alcohol, y = residual.sugar), data = wqw) + - geom_point(aes(color = quality)) + - ylim(0, 30) + geom_point(aes(color = factor(quality))) + + ylim(0, 30) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() ``` +Again just a higher chance for a higher quality as the alcohol increases. It doesn't look like the residual sugar plays into it much at all. + ```{r echo=FALSE, warning=FALSE, density_pH_quality} ggplot(aes(x = density, y = pH), data = wqw) + - geom_point(aes(color = quality)) + - xlim(0.985, 1.005) + geom_point(aes(color = factor(quality))) + + xlim(0.985, 1.005) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() ``` -```{r echo=FALSE, warning=FALSE, free.sulfur.dioxide_pH_quality} -ggplot(aes(x = free.sulfur.dioxide, y = pH), data = wqw) + - geom_point(aes(color = quality)) + - xlim(0, 100) +There is no real discinction here, possibly a slightly higher chance for high quality at a lower density. But apparently pH doesn't matter at all. + +```{r echo=FALSE, warning=FALSE, free.sulfur.dioxide_fixed.acidity_quality} +ggplot(aes(x = free.sulfur.dioxide, y = fixed.acidity), data = wqw) + + geom_point(aes(color = factor(quality))) + + xlim(0, 100) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() ``` -```{r echo=FALSE, warning=FALSE, alcohol_pH_quality} -ggplot(aes(x = alcohol, y = pH), data = wqw) + - geom_point(aes(color = quality)) +It looks like there might be a trend towards lower fixed acidity. I wonder about a combination of fixed and volatile acidity when combined with alcohol. + +```{r echo=FALSE, warning=FALSE, alcohol_fixed.volatile.acidity_quality} +ggplot(aes(x = alcohol, y = fixed.acidity + volatile.acidity), data = wqw) + + geom_point(aes(color = factor(quality))) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() ``` +Doesn't really appear to be any different than just alcohol content. There might be a slight trend towards lower acidity. + ```{r echo=FALSE, warning=FALSE, alcohol_density_quality} ggplot(aes(x = alcohol, y = density), data = wqw) + - geom_point(aes(color = quality), position = position_jitter(h = 0)) + - ylim(0.985, 1.005) + geom_point(aes(color = factor(quality)), position = position_jitter(h = 0)) + + ylim(0.985, 1.005) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() ``` -This is really the only plot that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density (which we discovered an inverse corelation between alcohol and density earlier so that should make sense). +These last two plots are really the only ones that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density or lower acidity. Lets see if a linear model can make any predictions. ```{r echo=FALSE, warning=FALSE, Building_the_Linear_Model} -m1 <- lm(I(quality) ~ I(alcohol), data = wqw) +m1 <- lm(quality ~ alcohol, data = wqw) m2 <- update(m1, ~ . + density) m3 <- update(m2, ~ . + residual.sugar) m4 <- update(m3, ~ . + chlorides) m5 <- update(m4, ~ . + sulphates) m6 <- update(m5, ~ . + pH) -m7 <- update(m6, ~ . + fixed.acidity) -m8 <- update(m7, ~ . + volatile.acidity) -m9 <- update(m8, ~ . + citric.acid) -m10 <- update(m9, ~ . + free.sulfur.dioxide) -m11 <- update(m10, ~ . + total.sulfur.dioxide) -mtable(m1, m2, m5, m6, m9, m11, sdigits = 3) +m7 <- update(m6, ~ . + fixed.acidity + volatile.acidity) +m8 <- update(m7, ~ . + citric.acid) +m9 <- update(m8, ~ . + free.sulfur.dioxide) +m10 <- update(m9, ~ . + total.sulfur.dioxide) +mtable(m1, m2, m5, m7, m9, m10, sdigits = 3) ``` -As we can see even when taking into account every feature the R-squared is still only 0.282 which is dismal at best and indicates that we can not make any predictions based on the data that we have. +```{r echo=FALSE, warning=FALSE, Plotting_Residuals} +par(mfrow=c(2,2)) +plot(m10) +par(mfrow=c(1,1)) +``` -(I had to remove some of the intermediary steps to make it fit on the page.) +Looking at the residuals plots there appears to be one outlier that could be effecting the output of the model so I am going to remove that datapoint and re-run the model. + +```{r echo=FALSE, warning=FALSE, Building_the_Linear_Model_2} +wqw.new = wqw[-2782,] +m1 <- lm(quality ~ alcohol, data = wqw.new) +m2 <- update(m1, ~ . + density) +m3 <- update(m2, ~ . + residual.sugar) +m4 <- update(m3, ~ . + chlorides) +m5 <- update(m4, ~ . + sulphates) +m6 <- update(m5, ~ . + pH) +m7 <- update(m6, ~ . + fixed.acidity + volatile.acidity) +m8 <- update(m7, ~ . + citric.acid) +m9 <- update(m8, ~ . + free.sulfur.dioxide) +m10 <- update(m9, ~ . + total.sulfur.dioxide) +mtable(m1, m2, m5, m7, m9, m10, sdigits = 3) +``` + +```{r echo=FALSE, warning=FALSE, Plotting_Residuals_2} +par(mfrow=c(2,2)) +plot(m10) +par(mfrow=c(1,1)) +``` + +We got a very slight increase to the model but not very much and it looks like we got rid of all the major outliers. + +As we can see even when taking into account every feature and removing the outlier the R-squared is still only 0.285 which is dismal at best and indicates that we can not make any predictions based on the data that we have. + +(I had to remove some of the intermediary steps to make the model fit on the page.) # Multivariate Analysis @@ -439,8 +506,10 @@ I did create a basic model and it was not able to predict anything. The main lim # Final Plots and Summary ### Plot One -```{r echo=FALSE, Plot_One} -ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)), lower = list(continuous = wrap("smooth", alpha=0.2, color = "orange"))) + +```{r echo=FALSE, warning=FALSE, fig.width=10, fig.height=10, Plot_One} +ggpairs(wqw, upper = list(continuous = wrap("cor", size = 3)), + lower = list(continuous = wrap("smooth", alpha=0.2, + color = "orange"))) + theme_grey(base_size = 6) + ggtitle("Scatterplot Matrix") + theme(plot.title = element_text(size=22, hjust = 0.5)) @@ -452,34 +521,28 @@ This is a good summary of the data that we have and it shows how there is no dir ### Plot Two ```{r echo=FALSE, warning=FALSE, Plot_Two} -ggplot(aes(x = quality, y = alcohol), data = wqw) + - geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") + - geom_line(stat = 'summary', fun.y = mean, color = "blue") + - geom_line(stat = 'summary', fun.y = median, color = "black") + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) + - geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) + - ggtitle("Alcohol vs Quality") + - xlab("Quality") + - ylab("Alcohol") + +ggplot(aes(x = alcohol, y = fixed.acidity + volatile.acidity), data = wqw) + + geom_point(aes(color = factor(quality))) + + scale_color_brewer(palette = "RdYlGn") + + theme_dark() + + labs(x = "Alcohol (%)", y = "Total Acidity (g/dm^3)", title = "Acidity vs Alcohol by Quality", color = "Quality") + theme(plot.title = element_text(size=22, hjust = 0.5)) ``` ### Description Two -Alcohol content is the closest that any of the features came to corelating with the quality and here you can see that even that corelation is very weak. The only thing we can tell is that more higher quality wines had a higher alcohol content than lower quality, but the spread on the data makes this a very weak corelation at 0.436. +The only distiction I was able to discover was based on alcohol content and it is very slight at best. It does appear that a higher alcohol content increases the chance of a higher quality product but there is no clear distinction that can be seen. While the high quality products mostly have a higher alcohol content and low quality products have lower alcohol content the mid range products span the whole spectrum. Based on this it would be hard to determine the difference between a 6, 7, 8, or 9 quality based on the data provided. But you could probably tell the difference between a 4 and an 8. ### Plot Three ```{r echo=FALSE, warning=FALSE, Plot_Three} -ggplot(aes(x = density, y = alcohol), data = wqw) + - geom_point(aes(color = quality)) + - xlim(0.985, 1.005) + - labs(x = "Density", y = "Alcohol", title = "Alcohol vs Density by Quality", color = "Quality") + - theme(plot.title = element_text(size=22, hjust = 0.5)) +par(mfrow=c(2,2)) +plot(m10) +par(mfrow=c(1,1)) ``` ### Description Three -I include this plot just to show how there is no clear distinction in the quality when compared to the features of the data. This is representative of all of the plots I made in the multivariate section. +After some research it appears that the pattern shown in the Residuals vs. Fitted plot is most likely due to the fact that our dependent variable has only a few possible values. The patterns in the Scale-Location could indicate that a linear model is not the best for our data. ------ diff --git a/EDA_Project/EDA_Project.zip b/EDA_Project/EDA_Project.zip new file mode 100644 index 0000000..e65ba27 Binary files /dev/null and b/EDA_Project/EDA_Project.zip differ