Exploration of White Wines by Dustin Pianalto

This report explores a dataset containing chemical information and ratings on almost 4900 white wine tastings.

Univariate Plots Section

## [1] 4898   12

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Our Data consists of 11 numerical variables and one Integer attribute which is the output with almost 4900 observations

The distribution of the quality seems fairly normal with a peak at 6

The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.

The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Most Wines have a acidity between 6.3 and 7.3 I am going to plot the data again removing both the high and low 1% of values to remove the outliers.

And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.

We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.

There is an odd spike at about 0.49 I might want to look into that more later.

Even with the top and bottom 1% removed the plot is still very long tailed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. This is probably is because it is harder to measure the residual sugar as a continuous scale and so the steps are more apparent at the lower, more spread out, values. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.

Here I just removed the top 3% of values to remove the long tail.

I plotted the Free Sulphur Dioxide and Total Sulphur Dioxide together to save room and because they are related. Note the difference in scales on both axies.

The pH plot doesn’t need any modification.

Univariate Analysis

What is the structure of your dataset?

There are 4898 samples in the dataset with 11 different variables and a resulting quality assesment. All of the variables are continuous number variables and the quality is an integer scale from 1 to 10 with max value of 9 and min of 3.

Observations: * The most common quality is 6 and it is a fairly normal distribution slightly skewed towards the low end. * Most of the variables are similar in distribution, most of them are long tailed but besides that have a fairly normal distribution. * There are a couple interesting features though, the Citric Acid has an odd spike around 4.9 and the Residual Sugar appears to be more of a bimodal distribution.

What is/are the main feature(s) of interest in your dataset?

My main interest in this dataset is trying to determine which features have the greatest effect on the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think that the Alcohol, Acidity, Density, and Ph will have the greatest impact on the quality.

Did you create any new variables from existing variables in the dataset?

I did not create any new variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I either log transformed or removed the outliers on most of the datapoints to better view the data as most of them were longtailed.

Bivariate Plots Section

Looking at this grid there are not very many variables that appear to correlate with each other which I find suprising as some of them seem like they should. I am going to explore some of them in more detail.

One pair of variables that look like they have some correlation are density and residual sugar so I am going to start with them.

Narrowing in on the main section and adding a smoothing line.

We can see a general trend as residual sugar increases the density also increases. Lets see both of these plotted against our output variable.

There doesn’t seem to be any direct corelation between these variables and the quality. Lets look at some others.

Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak.

Looking at these other variables shows that there is little to no relationship to the quality individually I think this will change when we start combining variables in the Multivariate plots.

One other interesting corelation that I want to look at is density vs alcohol.

Interestingly it appears that as the aocohol content increases the density decreases, this is the inverse of the residual sugar vs density that we plotted earlier. This probably has something to do with the fact that sugar is what the alcohol is created from so it would follow that as the alcohol increases the sugar and thence the density would decrease.

We can see this more directly by plotting residual sugar against alcohol.

We can see that there is a seemingly exponential relationship between alcohol and residual sugar.

There does seem to be a slight corelation between alcohol and chlorides.

There does not seem to be any correlation between our other features of interest.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I discovered some interesting relationships between density, residual sugar and alcohol. The other features appear to have very little corelation to each other or to the quality. The other relationships that I noted are the ones that were expected. For instance the pH has a mild corelation to the fixed acidity although I expected a higher corelation. Same with total sulfur dioxide and free sulfur dioxide.

It does seem like there is a mild corelation between the quality and alcohol as well as quality and density which are 2 of the features I noted in the previous section. There also might be a slight relationship between quality and chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

One relationship that I found interesting is between alcohol and chlorides as well as between chlorides and quality. I wonder if this will show itself more in the multivariate exploration.

What was the strongest relationship you found?

By far the strongest relationship I found was between density and residual sugar.

Multivariate Plots Section

This is really the only plot that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density (which we discovered an inverse corelation between alcohol and density earlier so that should make sense).

Lets see if a linear model can make any predictions.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wqw)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wqw)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
##     chlorides + sulphates, data = wqw)
## m6: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
##     chlorides + sulphates + pH, data = wqw)
## m9: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
##     chlorides + sulphates + pH + fixed.acidity + volatile.acidity + 
##     citric.acid, data = wqw)
## m11: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar + 
##     chlorides + sulphates + pH + fixed.acidity + volatile.acidity + 
##     citric.acid + free.sulfur.dioxide + total.sulfur.dioxide, 
##     data = wqw)
## 
## ============================================================================================================
##                              m1            m2            m5            m6            m9           m11       
## ------------------------------------------------------------------------------------------------------------
##   (Intercept)               2.582***    -22.492***    112.492***    134.445***    157.665***    150.193***  
##                            (0.098)       (6.165)      (12.783)      (13.137)      (18.458)      (18.804)    
##   I(alcohol)                0.313***      0.360***      0.209***      0.179***      0.182***      0.193***  
##                            (0.009)       (0.015)       (0.019)       (0.019)       (0.024)       (0.024)    
##   density                                24.728***   -110.148***   -133.690***   -157.700***   -150.284***  
##                                          (6.079)      (12.743)      (13.159)      (18.725)      (19.075)    
##   residual.sugar                                        0.061***      0.073***      0.087***      0.081***  
##                                                        (0.005)       (0.006)       (0.007)       (0.008)    
##   chlorides                                            -1.724**      -1.388*       -0.134        -0.247     
##                                                        (0.552)       (0.552)       (0.547)       (0.547)    
##   sulphates                                             0.749***      0.692***      0.658***      0.631***  
##                                                        (0.102)       (0.102)       (0.100)       (0.100)    
##   pH                                                                  0.532***      0.714***      0.686***  
##                                                                      (0.079)       (0.105)       (0.105)    
##   fixed.acidity                                                                     0.063**       0.066**   
##                                                                                    (0.021)       (0.021)    
##   volatile.acidity                                                                 -1.930***     -1.863***  
##                                                                                    (0.111)       (0.114)    
##   citric.acid                                                                       0.055         0.022     
##                                                                                    (0.096)       (0.096)    
##   free.sulfur.dioxide                                                                             0.004***  
##                                                                                                  (0.001)    
##   total.sulfur.dioxide                                                                           -0.000     
##                                                                                                  (0.000)    
## ------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.220         0.228         0.278         0.282     
##   adj. R-squared            0.190         0.192         0.220         0.227         0.277         0.280     
##   sigma                     0.797         0.796         0.782         0.779         0.753         0.751     
##   F                      1146.395       583.290       276.676       240.191       209.335       174.344     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5744.736     -5722.182     -5556.206     -5543.740     
##   Deviance               3112.257      3101.773      2994.261      2966.812      2772.404      2758.329     
##   AIC                   11684.782     11670.255     11503.472     11460.364     11134.411     11113.480     
##   BIC                   11704.272     11696.241     11548.948     11512.336     11205.874     11197.936     
##   N                      4898          4898          4898          4898          4898          4898         
## ============================================================================================================

As we can see even when taking into account every feature the R-squared is still only 0.282 which is dismal at best and indicates that we can not make any predictions based on the data that we have.

(I had to remove some of the intermediary steps to make it fit on the page.)

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

All of the features that I investigated in this section show a dramatic lack of corelation. Even when combining features in different ways there was little to no interaction.

There were a few things that I discovered earlier that were confirmed but there wasn’t really anything new to explore.

Were there any interesting or surprising interactions between features?

The only interesting thing was the complete lack of interesting interactions between features.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I did create a basic model and it was not able to predict anything. The main limitation of the model is that none of the features are corelated to the quality in any meaningful way.

Final Plots and Summary

Plot One

Description One

This is a good summary of the data that we have and it shows how there is no direct corelation between any of the variables and the quality. You can see some moderate corelation between some of the features such as residual sugar and density. Some of these corelations are something I focused on.

Plot Two

Description Two

Alcohol content is the closest that any of the features came to corelating with the quality and here you can see that even that corelation is very weak. The only thing we can tell is that more higher quality wines had a higher alcohol content than lower quality, but the spread on the data makes this a very weak corelation at 0.436.

Plot Three

Description Three

I include this plot just to show how there is no clear distinction in the quality when compared to the features of the data. This is representative of all of the plots I made in the multivariate section.

Reflection

The Wine dataset that I used contained information from almost 5,000 wine tastings with their quality rating included. Initially I examined the data to see the shape of each of the features and then started exploring how they interact with each other. Then I compared the features against the quality to see if any of the features could help to predict the quality of the product. Finally I created a linear model to see if there was anything I missed in the data that could create predictions.

In the beginning I thought that the quality would have something to do with the alcohol, density, pH, and acidity. As I examined the data it became more and more clear that there was little to no correlation between any of the features and the quality. I found this suprising and really wanted to find any little thing that would point towards a corelation but nothing showed up. Finally when I created the linear model it was clear that you could not predict the quality of the wine from the data that we have in this dataset. We do see some small corelation between the alcohol content and the quality, it appears that the higher the alcohol content the more likely the wine will have a higher quality but there is definately not enough destinction to make any predictions.

I don’t know if more datapoints could make a difference but it seems at this point that the quality of wine is subjective and is difficult if not impossible to predict. I might be able to improve the models with more manipulation of the data but other models that I have seen max out at ~70% accuracy such as PennState’s STAT 897D Analysis of Wine Quality Data (https://onlinecourses.science.psu.edu/stat857/node/223/), and R-bloggers Predicting wine quality using Random Forests (https://www.r-bloggers.com/predicting-wine-quality-using-random-forests/) which use a lot more complex modeling than a basic linear model. R-bloggers use a random forest and acheived a 71.5% accuracy which is still very low for making predictions. They also had to modify the quality variable into groups where 3, 4, and 5 were considered low quality, 6 and 7 were medium, and 8 and 9 were high quality. I personally find this unacceptable as it degrades the quality of the output and artifically pushes the prediction rate higher. On a scale of 1-10 the difference between 8 and 9 can be quite substantial.

EDA_Project

Dusty P

July 19, 2018

Exploration of White Wines by Dustin Pianalto

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Reflection

EDA_Project

Dusty P

July 19, 2018

Exploration of White Wines by Dustin Pianalto

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Reflection

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.