This report explores a dataset containing chemical information and ratings on almost 4900 white wine tastings.
## [1] 4898 12
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Our Data consists of 11 numerical variables and one Integer attribute which is the output with almost 4900 observations
The distribution of the quality seems fairly normal with a peak at 6
The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.
The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Most Wines have a acidity between 6.3 and 7.3 I am going to plot the data again removing both the high and low 1% of values to remove the outliers.
And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.
We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.
There is an odd spike at about 0.49 I might want to look into that more later.
Even with the top and bottom 1% removed the plot is still very long tailed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. This is probably is because it is harder to measure the residual sugar as a continuous scale and so the steps are more apparent at the lower, more spread out, values. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.
Here I just removed the top 3% of values to remove the long tail.
I plotted the Free Sulphur Dioxide and Total Sulphur Dioxide together to save room and because they are related. Note the difference in scales on both axies.
The pH plot doesn’t need any modification.
There are 4898 samples in the dataset with 11 different variables and a resulting quality assesment. All of the variables are continuous number variables and the quality is an integer scale from 1 to 10 with max value of 9 and min of 3.
Observations: * The most common quality is 6 and it is a fairly normal distribution slightly skewed towards the low end. * Most of the variables are similar in distribution, most of them are long tailed but besides that have a fairly normal distribution. * There are a couple interesting features though, the Citric Acid has an odd spike around 4.9 and the Residual Sugar appears to be more of a bimodal distribution.
My main interest in this dataset is trying to determine which features have the greatest effect on the quality.
I think that the Alcohol, Acidity, Density, and Ph will have the greatest impact on the quality.
I did not create any new variables.
I either log transformed or removed the outliers on most of the datapoints to better view the data as most of them were longtailed.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
Looking at this grid there are not very many variables that appear to correlate with each other which I find suprising as some of them seem like they should. I am going to explore some of them in more detail.
One pair of variables that look like they have some correlation are density and residual sugar so I am going to start with them.
Narrowing in on the main section and adding a smoothing line.
## `geom_smooth()` using method = 'gam'
We can see a general trend as residual sugar increases the density also increases. Lets see both of these plotted against our output variable.
There doesn’t seem to be any direct corelation between these variables and the quality. Lets look at some others.
Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak.
Looking at these other variables shows that there is little to no relationship to the quality individually I think this will change when we start combining variables in the Multivariate plots.
One other interesting corelation that I want to look at is density vs alcohol.
## `geom_smooth()` using method = 'gam'
Interestingly it appears that as the aocohol content increases the density decreases, this is the inverse of the residual sugar vs density that we plotted earlier. This probably has something to do with the fact that sugar is what the alcohol is created from so it would follow that as the alcohol increases the sugar and thence the density would decrease.
We can see this more directly by plotting residual sugar against alcohol.
## `geom_smooth()` using method = 'gam'
We can see that there is a seemingly exponential relationship between alcohol and residual sugar.
## `geom_smooth()` using method = 'gam'
There does seem to be a slight corelation between alcohol and chlorides.
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'
There does not seem to be any correlation between our other features of interest.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
I discovered some interesting relationships between density, residual sugar and alcohol. The other features appear to have very little corelation to each other or to the quality. The other relationships that I noted are the ones that were expected. For instance the pH has a mild corelation to the fixed acidity although I expected a higher corelation. Same with total sulfur dioxide and free sulfur dioxide.
It does seem like there is a mild corelation between the quality and alcohol as well as quality and density which are 2 of the features I noted in the previous section. There also might be a slight relationship between quality and chlorides.
One relationship that I found interesting is between alcohol and chlorides as well as between chlorides and quality. I wonder if this will show itself more in the multivariate exploration.
By far the strongest relationship I found was between density and residual sugar.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
This is really the only plot that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density (which we discovered an inverse corelation between alcohol and density earlier so that should make sense).
Lets see if a linear model can make any predictions.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wqw)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wqw)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
## chlorides + sulphates, data = wqw)
## m6: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
## chlorides + sulphates + pH, data = wqw)
## m9: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
## chlorides + sulphates + pH + fixed.acidity + volatile.acidity +
## citric.acid, data = wqw)
## m11: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
## chlorides + sulphates + pH + fixed.acidity + volatile.acidity +
## citric.acid + free.sulfur.dioxide + total.sulfur.dioxide,
## data = wqw)
##
## ============================================================================================================
## m1 m2 m5 m6 m9 m11
## ------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** 112.492*** 134.445*** 157.665*** 150.193***
## (0.098) (6.165) (12.783) (13.137) (18.458) (18.804)
## I(alcohol) 0.313*** 0.360*** 0.209*** 0.179*** 0.182*** 0.193***
## (0.009) (0.015) (0.019) (0.019) (0.024) (0.024)
## density 24.728*** -110.148*** -133.690*** -157.700*** -150.284***
## (6.079) (12.743) (13.159) (18.725) (19.075)
## residual.sugar 0.061*** 0.073*** 0.087*** 0.081***
## (0.005) (0.006) (0.007) (0.008)
## chlorides -1.724** -1.388* -0.134 -0.247
## (0.552) (0.552) (0.547) (0.547)
## sulphates 0.749*** 0.692*** 0.658*** 0.631***
## (0.102) (0.102) (0.100) (0.100)
## pH 0.532*** 0.714*** 0.686***
## (0.079) (0.105) (0.105)
## fixed.acidity 0.063** 0.066**
## (0.021) (0.021)
## volatile.acidity -1.930*** -1.863***
## (0.111) (0.114)
## citric.acid 0.055 0.022
## (0.096) (0.096)
## free.sulfur.dioxide 0.004***
## (0.001)
## total.sulfur.dioxide -0.000
## (0.000)
## ------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.220 0.228 0.278 0.282
## adj. R-squared 0.190 0.192 0.220 0.227 0.277 0.280
## sigma 0.797 0.796 0.782 0.779 0.753 0.751
## F 1146.395 583.290 276.676 240.191 209.335 174.344
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5744.736 -5722.182 -5556.206 -5543.740
## Deviance 3112.257 3101.773 2994.261 2966.812 2772.404 2758.329
## AIC 11684.782 11670.255 11503.472 11460.364 11134.411 11113.480
## BIC 11704.272 11696.241 11548.948 11512.336 11205.874 11197.936
## N 4898 4898 4898 4898 4898 4898
## ============================================================================================================
As we can see even when taking into account every feature the R-squared is still only 0.282 which is dismal at best and indicates that we can not make any predictions based on the data that we have.
(I had to remove some of the intermediary steps to make it fit on the page.)
All of the features that I investigated in this section show a dramatic lack of corelation. Even when combining features in different ways there was little to no interaction.
There were a few things that I discovered earlier that were confirmed but there wasn’t really anything new to explore.
The only interesting thing was the complete lack of interesting interactions between features.
I did create a basic model and it was not able to predict anything. The main limitation of the model is that none of the features are corelated to the quality in any meaningful way.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!