diff --git a/EDA_Project/EDA_Project.html b/EDA_Project/EDA_Project.html index ca9ab21..1567aeb 100644 --- a/EDA_Project/EDA_Project.html +++ b/EDA_Project/EDA_Project.html @@ -179,7 +179,7 @@ $(document).ready(function () {
The distribution of the quality seems fairly normal with a peak at 6
The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.
-The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
@@ -189,12 +189,12 @@ $(document).ready(function () {
And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.

We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.
-
+

There is an odd spike at about 0.49 I might want to look into that more later.

Even with the top and bottom 1% removed the plot is still very long tailed
-
+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. This is probably is because it is harder to measure the residual sugar as a continuous scale and so the steps are more apparent at the lower, more spread out, values. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.
@@ -235,22 +235,22 @@ of the data? If so, why did you do this?
Bivariate Plots Section
-
+
Looking at this grid there are not very many variables that appear to correlate with each other which I find suprising as some of them seem like they should. I am going to explore some of them in more detail.
One pair of variables that look like they have some correlation are density and residual sugar so I am going to start with them.

Narrowing in on the main section and adding a smoothing line.

We can see a general trend as residual sugar increases the density also increases. Lets see both of these plotted against our output variable.
-
+

There doesn’t seem to be any direct corelation between these variables and the quality. Lets look at some others.

-
+
Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak.
-
-
-
+
+
+
Looking at these other variables shows that there is little to no relationship to the quality individually I think this will change when we start combining variables in the Multivariate plots.
One other interesting corelation that I want to look at is density vs alcohol.

@@ -283,71 +283,135 @@ of the data? If so, why did you do this?
Multivariate Plots Section
-
-
-
-
-
-
-This is really the only plot that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density (which we discovered an inverse corelation between alcohol and density earlier so that should make sense).
+Since there seems to be a relationship bewteen alcohol and chlorides as well as chlorides and quality lets take a look at that relationship first.
+
+I find this to be suprising. I expected at least a mild distiction in this plot but it only shows a general trend that the higher the alcohol the more likely to have a higher quality but there isn’t anything here we can use to make accurate predictions.
+Lets take a look at some other relationships we identified earlier.
+
+Again just a higher chance for a higher quality as the alcohol increases. It doesn’t look like the residual sugar plays into it much at all.
+
+There is no real discinction here, possibly a slightly higher chance for high quality at a lower density. But apparently pH doesn’t matter at all.
+
+It looks like there might be a trend towards lower fixed acidity. I wonder about a combination of fixed and volatile acidity when combined with alcohol.
+
+Doesn’t really appear to be any different than just alcohol content. There might be a slight trend towards lower acidity.
+
+These last two plots are really the only ones that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density or lower acidity.
Lets see if a linear model can make any predictions.
##
## Calls:
-## m1: lm(formula = I(quality) ~ I(alcohol), data = wqw)
-## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wqw)
-## m5: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
-## chlorides + sulphates, data = wqw)
-## m6: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
-## chlorides + sulphates + pH, data = wqw)
-## m9: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
-## chlorides + sulphates + pH + fixed.acidity + volatile.acidity +
-## citric.acid, data = wqw)
-## m11: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
-## chlorides + sulphates + pH + fixed.acidity + volatile.acidity +
-## citric.acid + free.sulfur.dioxide + total.sulfur.dioxide,
-## data = wqw)
+## m1: lm(formula = quality ~ alcohol, data = wqw)
+## m2: lm(formula = quality ~ alcohol + density, data = wqw)
+## m5: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates, data = wqw)
+## m7: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates + pH + fixed.acidity + volatile.acidity, data = wqw)
+## m9: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates + pH + fixed.acidity + volatile.acidity + citric.acid +
+## free.sulfur.dioxide, data = wqw)
+## m10: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates + pH + fixed.acidity + volatile.acidity + citric.acid +
+## free.sulfur.dioxide + total.sulfur.dioxide, data = wqw)
##
## ============================================================================================================
-## m1 m2 m5 m6 m9 m11
+## m1 m2 m5 m7 m9 m10
## ------------------------------------------------------------------------------------------------------------
-## (Intercept) 2.582*** -22.492*** 112.492*** 134.445*** 157.665*** 150.193***
-## (0.098) (6.165) (12.783) (13.137) (18.458) (18.804)
-## I(alcohol) 0.313*** 0.360*** 0.209*** 0.179*** 0.182*** 0.193***
-## (0.009) (0.015) (0.019) (0.019) (0.024) (0.024)
-## density 24.728*** -110.148*** -133.690*** -157.700*** -150.284***
-## (6.079) (12.743) (13.159) (18.725) (19.075)
-## residual.sugar 0.061*** 0.073*** 0.087*** 0.081***
-## (0.005) (0.006) (0.007) (0.008)
-## chlorides -1.724** -1.388* -0.134 -0.247
-## (0.552) (0.552) (0.547) (0.547)
-## sulphates 0.749*** 0.692*** 0.658*** 0.631***
-## (0.102) (0.102) (0.100) (0.100)
-## pH 0.532*** 0.714*** 0.686***
-## (0.079) (0.105) (0.105)
-## fixed.acidity 0.063** 0.066**
-## (0.021) (0.021)
-## volatile.acidity -1.930*** -1.863***
-## (0.111) (0.114)
-## citric.acid 0.055 0.022
+## (Intercept) 2.582*** -22.492*** 112.492*** 156.891*** 152.979*** 150.193***
+## (0.098) (6.165) (12.783) (18.407) (18.439) (18.804)
+## alcohol 0.313*** 0.360*** 0.209*** 0.183*** 0.193*** 0.193***
+## (0.009) (0.015) (0.019) (0.024) (0.024) (0.024)
+## density 24.728*** -110.148*** -156.909*** -153.111*** -150.284***
+## (6.079) (12.743) (18.673) (18.704) (19.075)
+## residual.sugar 0.061*** 0.087*** 0.082*** 0.081***
+## (0.005) (0.007) (0.007) (0.008)
+## chlorides -1.724** -0.099 -0.251 -0.247
+## (0.552) (0.544) (0.546) (0.547)
+## sulphates 0.749*** 0.661*** 0.626*** 0.631***
+## (0.102) (0.100) (0.100) (0.100)
+## pH 0.709*** 0.688*** 0.686***
+## (0.105) (0.105) (0.105)
+## fixed.acidity 0.065** 0.066** 0.066**
+## (0.021) (0.021) (0.021)
+## volatile.acidity -1.942*** -1.880*** -1.863***
+## (0.110) (0.112) (0.114)
+## citric.acid 0.019 0.022
## (0.096) (0.096)
-## free.sulfur.dioxide 0.004***
-## (0.001)
+## free.sulfur.dioxide 0.003*** 0.004***
+## (0.001) (0.001)
## total.sulfur.dioxide -0.000
## (0.000)
## ------------------------------------------------------------------------------------------------------------
-## R-squared 0.190 0.192 0.220 0.228 0.278 0.282
-## adj. R-squared 0.190 0.192 0.220 0.227 0.277 0.280
-## sigma 0.797 0.796 0.782 0.779 0.753 0.751
-## F 1146.395 583.290 276.676 240.191 209.335 174.344
+## R-squared 0.190 0.192 0.220 0.278 0.282 0.282
+## adj. R-squared 0.190 0.192 0.220 0.277 0.280 0.280
+## sigma 0.797 0.796 0.782 0.753 0.751 0.751
+## F 1146.395 583.290 276.676 235.493 191.738 174.344
## p 0.000 0.000 0.000 0.000 0.000 0.000
-## Log-likelihood -5839.391 -5831.127 -5744.736 -5722.182 -5556.206 -5543.740
-## Deviance 3112.257 3101.773 2994.261 2966.812 2772.404 2758.329
-## AIC 11684.782 11670.255 11503.472 11460.364 11134.411 11113.480
-## BIC 11704.272 11696.241 11548.948 11512.336 11205.874 11197.936
+## Log-likelihood -5839.391 -5831.127 -5744.736 -5556.370 -5544.026 -5543.740
+## Deviance 3112.257 3101.773 2994.261 2772.590 2758.651 2758.329
+## AIC 11684.782 11670.255 11503.472 11132.740 11112.053 11113.480
+## BIC 11704.272 11696.241 11548.948 11197.706 11190.012 11197.936
## N 4898 4898 4898 4898 4898 4898
## ============================================================================================================
-As we can see even when taking into account every feature the R-squared is still only 0.282 which is dismal at best and indicates that we can not make any predictions based on the data that we have.
-(I had to remove some of the intermediary steps to make it fit on the page.)
+
+Looking at the residuals plots there appears to be one outlier that could be effecting the output of the model so I am going to remove that datapoint and re-run the model.
+##
+## Calls:
+## m1: lm(formula = quality ~ alcohol, data = wqw.new)
+## m2: lm(formula = quality ~ alcohol + density, data = wqw.new)
+## m5: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates, data = wqw.new)
+## m7: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates + pH + fixed.acidity + volatile.acidity, data = wqw.new)
+## m9: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates + pH + fixed.acidity + volatile.acidity + citric.acid +
+## free.sulfur.dioxide, data = wqw.new)
+## m10: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
+## sulphates + pH + fixed.acidity + volatile.acidity + citric.acid +
+## free.sulfur.dioxide + total.sulfur.dioxide, data = wqw.new)
+##
+## ============================================================================================================
+## m1 m2 m5 m7 m9 m10
+## ------------------------------------------------------------------------------------------------------------
+## (Intercept) 2.582*** -27.042*** 121.165*** 212.207*** 211.030*** 211.424***
+## (0.098) (6.608) (13.972) (21.975) (21.978) (22.737)
+## alcohol 0.314*** 0.369*** 0.196*** 0.112*** 0.119*** 0.119***
+## (0.009) (0.015) (0.020) (0.029) (0.029) (0.029)
+## density 29.213*** -118.763*** -212.821*** -211.794*** -212.194***
+## (6.516) (13.921) (22.268) (22.271) (23.040)
+## residual.sugar 0.064*** 0.105*** 0.101*** 0.101***
+## (0.005) (0.008) (0.008) (0.009)
+## chlorides -1.736** 0.073 -0.080 -0.080
+## (0.552) (0.544) (0.546) (0.546)
+## sulphates 0.762*** 0.734*** 0.702*** 0.702***
+## (0.102) (0.101) (0.101) (0.101)
+## pH 0.882*** 0.869*** 0.869***
+## (0.111) (0.112) (0.112)
+## fixed.acidity 0.107*** 0.111*** 0.111***
+## (0.023) (0.023) (0.023)
+## volatile.acidity -1.939*** -1.874*** -1.875***
+## (0.109) (0.111) (0.114)
+## citric.acid 0.025 0.025
+## (0.095) (0.096)
+## free.sulfur.dioxide 0.003*** 0.003***
+## (0.001) (0.001)
+## total.sulfur.dioxide 0.000
+## (0.000)
+## ------------------------------------------------------------------------------------------------------------
+## R-squared 0.190 0.193 0.221 0.281 0.285 0.285
+## adj. R-squared 0.190 0.193 0.220 0.280 0.284 0.284
+## sigma 0.797 0.796 0.782 0.752 0.750 0.750
+## F 1146.259 585.416 277.220 239.085 194.942 177.184
+## p 0.000 0.000 0.000 0.000 0.000 0.000
+## Log-likelihood -5838.650 -5828.614 -5742.882 -5545.220 -5531.742 -5531.740
+## Deviance 3112.194 3099.464 2992.817 2760.709 2745.554 2745.551
+## AIC 11683.300 11665.228 11499.764 11110.440 11087.484 11089.480
+## BIC 11702.789 11691.214 11545.238 11175.404 11165.441 11173.933
+## N 4897 4897 4897 4897 4897 4897
+## ============================================================================================================
+
+We got a very slight increase to the model but not very much and it looks like we got rid of all the major outliers.
+As we can see even when taking into account every feature and removing the outlier the R-squared is still only 0.285 which is dismal at best and indicates that we can not make any predictions based on the data that we have.
+(I had to remove some of the intermediary steps to make the model fit on the page.)
Multivariate Analysis
@@ -373,7 +437,7 @@ strengths and limitations of your model.
Final Plots and Summary
Plot One
-
+
Description One
@@ -381,19 +445,19 @@ strengths and limitations of your model.
Plot Two
-
+
Description Two
-Alcohol content is the closest that any of the features came to corelating with the quality and here you can see that even that corelation is very weak. The only thing we can tell is that more higher quality wines had a higher alcohol content than lower quality, but the spread on the data makes this a very weak corelation at 0.436.
+The only distiction I was able to discover was based on alcohol content and it is very slight at best. It does appear that a higher alcohol content increases the chance of a higher quality product but there is no clear distinction that can be seen. While the high quality products mostly have a higher alcohol content and low quality products have lower alcohol content the mid range products span the whole spectrum. Based on this it would be hard to determine the difference between a 6, 7, 8, or 9 quality based on the data provided. But you could probably tell the difference between a 4 and an 8.
Plot Three
-
+
Description Three
-I include this plot just to show how there is no clear distinction in the quality when compared to the features of the data. This is representative of all of the plots I made in the multivariate section.
+After some research it appears that the pattern shown in the Residuals vs. Fitted plot is most likely due to the fact that our dependent variable has only a few possible values. The patterns in the Scale-Location could indicate that a linear model is not the best for our data.
diff --git a/EDA_Project/EDA_Project.rmd b/EDA_Project/EDA_Project.rmd
index 5182925..9aace5e 100644
--- a/EDA_Project/EDA_Project.rmd
+++ b/EDA_Project/EDA_Project.rmd
@@ -6,7 +6,8 @@ output: html_document
---
```{r echo=FALSE, message=FALSE, warning=FALSE, setup}
-knitr::opts_knit$set(root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project"))
+knitr::opts_knit$set(
+ root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project"))
# load the ggplot graphics package and the others
library(ggplot2)
@@ -18,7 +19,8 @@ library(RColorBrewer)
library(bitops)
library(RCurl)
-cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
+cuberoot_trans = function() trans_new('cuberoot',
+ transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
@@ -66,7 +68,7 @@ The Alcohol seems to be slightly long tailed, I want to see what it is like with
```{r echo=FALSE, warning=FALSE, alcohol_histogram_log}
ggplot(aes(x = alcohol), data = wqw) +
geom_histogram(binwidth = .005) +
- scale_x_log10()
+ scale_x_log10(breaks = c(8, 9, 10, 11, 12, 13, 14))
```
```{r echo=FALSE, warning=FALSE, fixed.acidity_histogram}
@@ -101,7 +103,7 @@ We have another long tailed distribution. I am going to plot again with a log_10
```{r echo=FALSE, warning=FALSE, volatile.acidity_histogram_log}
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_histogram(binwidth = .04) +
- scale_x_log10()
+ scale_x_log10(breaks = seq(0.1, 1.0, 0.1))
```
```{r echo=FALSE, warning=FALSE, citric.acid_histogram}
@@ -123,10 +125,10 @@ Even with the top and bottom 1% removed the plot is still very long tailed
```{r echo=FALSE, warning=FALSE, residual.sugar_histogram_log}
p1 <- ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .05) +
- scale_x_log10()
+ scale_x_log10(breaks = c(0, 1, 2, 4, 6, 8, 12, 16, 20, 40, 65))
p2 <- ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .01) +
- scale_x_log10(breaks = seq(0, 20, 2))
+ scale_x_log10(breaks = c(0, 1, 2, 4, 6, 8, 12, 16, 20, 40, 65))
grid.arrange(p1, p2)
```
@@ -150,7 +152,8 @@ p1 <- ggplot(aes(x = free.sulfur.dioxide), data = wqw) +
xlim(0, quantile(wqw$free.sulfur.dioxide, 0.99))
p2 <- ggplot(aes(x = total.sulfur.dioxide), data = wqw) +
geom_histogram(binwidth = 1) +
- xlim(quantile(wqw$total.sulfur.dioxide, 0.01), quantile(wqw$total.sulfur.dioxide, 0.99))
+ xlim(quantile(wqw$total.sulfur.dioxide, 0.01),
+ quantile(wqw$total.sulfur.dioxide, 0.99))
grid.arrange(p1, p2)
```
@@ -207,8 +210,10 @@ I either log transformed or removed the outliers on most of the datapoints to be
# Bivariate Plots Section
-```{r echo=FALSE, warning=FALSE, Bivariate_Plots}
-ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)), lower = list(continuous = wrap("smooth", alpha=0.2, color = "orange"))) +
+```{r echo=FALSE, warning=FALSE, fig.width=10, fig.height=10, Bivariate_Plots}
+ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)),
+ lower = list(continuous = wrap("smooth", alpha=0.2,
+ color = "orange"))) +
theme_grey(base_size = 6)
```
@@ -239,8 +244,10 @@ ggplot(aes(x = quality, y = density), data = wqw) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") +
geom_line(stat = 'summary', fun.y = mean, color = "blue") +
geom_line(stat = 'summary', fun.y = median) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2)
+ geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1),
+ color = 'red', linetype = 2) +
+ geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9),
+ color = 'red', linetype = 2)
```
```{r echo=FALSE, warning=FALSE, quality_vs_residual.sugar}
@@ -262,8 +269,10 @@ ggplot(aes(x = quality, y = alcohol), data = wqw) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") +
geom_line(stat = 'summary', fun.y = mean, color = "blue") +
geom_line(stat = 'summary', fun.y = median) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2)
+ geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1),
+ color = 'red', linetype = 2) +
+ geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9),
+ color = 'red', linetype = 2)
```
Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak.
@@ -279,8 +288,10 @@ ggplot(aes(x = quality, y = chlorides), data = wqw) +
ylim(0, 0.1) +
geom_line(stat = 'summary', fun.y = mean, color = "blue") +
geom_line(stat = 'summary', fun.y = median) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2)
+ geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1),
+ color = 'red', linetype = 2) +
+ geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9),
+ color = 'red', linetype = 2)
```
```{r echo=FALSE, warning=FALSE, quality_vs_tsd}
@@ -358,62 +369,118 @@ By far the strongest relationship I found was between density and residual sugar
# Multivariate Plots Section
+Since there seems to be a relationship bewteen alcohol and chlorides as well as chlorides and quality lets take a look at that relationship first.
+
```{r echo=FALSE, warning=FALSE, alcohol_chlorides_quality}
ggplot(aes(x = alcohol, y = chlorides), data = wqw) +
- geom_point(aes(color = quality))
+ geom_point(aes(color = factor(quality))) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark()
```
+I find this to be suprising. I expected at least a mild distiction in this plot but it only shows a general trend that the higher the alcohol the more likely to have a higher quality but there isn't anything here we can use to make accurate predictions.
+
+Lets take a look at some other relationships we identified earlier.
+
```{r echo=FALSE, warning=FALSE, alcohol_residual.sugar_quality}
ggplot(aes(x = alcohol, y = residual.sugar), data = wqw) +
- geom_point(aes(color = quality)) +
- ylim(0, 30)
+ geom_point(aes(color = factor(quality))) +
+ ylim(0, 30) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark()
```
+Again just a higher chance for a higher quality as the alcohol increases. It doesn't look like the residual sugar plays into it much at all.
+
```{r echo=FALSE, warning=FALSE, density_pH_quality}
ggplot(aes(x = density, y = pH), data = wqw) +
- geom_point(aes(color = quality)) +
- xlim(0.985, 1.005)
+ geom_point(aes(color = factor(quality))) +
+ xlim(0.985, 1.005) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark()
```
-```{r echo=FALSE, warning=FALSE, free.sulfur.dioxide_pH_quality}
-ggplot(aes(x = free.sulfur.dioxide, y = pH), data = wqw) +
- geom_point(aes(color = quality)) +
- xlim(0, 100)
+There is no real discinction here, possibly a slightly higher chance for high quality at a lower density. But apparently pH doesn't matter at all.
+
+```{r echo=FALSE, warning=FALSE, free.sulfur.dioxide_fixed.acidity_quality}
+ggplot(aes(x = free.sulfur.dioxide, y = fixed.acidity), data = wqw) +
+ geom_point(aes(color = factor(quality))) +
+ xlim(0, 100) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark()
```
-```{r echo=FALSE, warning=FALSE, alcohol_pH_quality}
-ggplot(aes(x = alcohol, y = pH), data = wqw) +
- geom_point(aes(color = quality))
+It looks like there might be a trend towards lower fixed acidity. I wonder about a combination of fixed and volatile acidity when combined with alcohol.
+
+```{r echo=FALSE, warning=FALSE, alcohol_fixed.volatile.acidity_quality}
+ggplot(aes(x = alcohol, y = fixed.acidity + volatile.acidity), data = wqw) +
+ geom_point(aes(color = factor(quality))) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark()
```
+Doesn't really appear to be any different than just alcohol content. There might be a slight trend towards lower acidity.
+
```{r echo=FALSE, warning=FALSE, alcohol_density_quality}
ggplot(aes(x = alcohol, y = density), data = wqw) +
- geom_point(aes(color = quality), position = position_jitter(h = 0)) +
- ylim(0.985, 1.005)
+ geom_point(aes(color = factor(quality)), position = position_jitter(h = 0)) +
+ ylim(0.985, 1.005) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark()
```
-This is really the only plot that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density (which we discovered an inverse corelation between alcohol and density earlier so that should make sense).
+These last two plots are really the only ones that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density or lower acidity.
Lets see if a linear model can make any predictions.
```{r echo=FALSE, warning=FALSE, Building_the_Linear_Model}
-m1 <- lm(I(quality) ~ I(alcohol), data = wqw)
+m1 <- lm(quality ~ alcohol, data = wqw)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + residual.sugar)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + sulphates)
m6 <- update(m5, ~ . + pH)
-m7 <- update(m6, ~ . + fixed.acidity)
-m8 <- update(m7, ~ . + volatile.acidity)
-m9 <- update(m8, ~ . + citric.acid)
-m10 <- update(m9, ~ . + free.sulfur.dioxide)
-m11 <- update(m10, ~ . + total.sulfur.dioxide)
-mtable(m1, m2, m5, m6, m9, m11, sdigits = 3)
+m7 <- update(m6, ~ . + fixed.acidity + volatile.acidity)
+m8 <- update(m7, ~ . + citric.acid)
+m9 <- update(m8, ~ . + free.sulfur.dioxide)
+m10 <- update(m9, ~ . + total.sulfur.dioxide)
+mtable(m1, m2, m5, m7, m9, m10, sdigits = 3)
```
-As we can see even when taking into account every feature the R-squared is still only 0.282 which is dismal at best and indicates that we can not make any predictions based on the data that we have.
+```{r echo=FALSE, warning=FALSE, Plotting_Residuals}
+par(mfrow=c(2,2))
+plot(m10)
+par(mfrow=c(1,1))
+```
-(I had to remove some of the intermediary steps to make it fit on the page.)
+Looking at the residuals plots there appears to be one outlier that could be effecting the output of the model so I am going to remove that datapoint and re-run the model.
+
+```{r echo=FALSE, warning=FALSE, Building_the_Linear_Model_2}
+wqw.new = wqw[-2782,]
+m1 <- lm(quality ~ alcohol, data = wqw.new)
+m2 <- update(m1, ~ . + density)
+m3 <- update(m2, ~ . + residual.sugar)
+m4 <- update(m3, ~ . + chlorides)
+m5 <- update(m4, ~ . + sulphates)
+m6 <- update(m5, ~ . + pH)
+m7 <- update(m6, ~ . + fixed.acidity + volatile.acidity)
+m8 <- update(m7, ~ . + citric.acid)
+m9 <- update(m8, ~ . + free.sulfur.dioxide)
+m10 <- update(m9, ~ . + total.sulfur.dioxide)
+mtable(m1, m2, m5, m7, m9, m10, sdigits = 3)
+```
+
+```{r echo=FALSE, warning=FALSE, Plotting_Residuals_2}
+par(mfrow=c(2,2))
+plot(m10)
+par(mfrow=c(1,1))
+```
+
+We got a very slight increase to the model but not very much and it looks like we got rid of all the major outliers.
+
+As we can see even when taking into account every feature and removing the outlier the R-squared is still only 0.285 which is dismal at best and indicates that we can not make any predictions based on the data that we have.
+
+(I had to remove some of the intermediary steps to make the model fit on the page.)
# Multivariate Analysis
@@ -439,8 +506,10 @@ I did create a basic model and it was not able to predict anything. The main lim
# Final Plots and Summary
### Plot One
-```{r echo=FALSE, Plot_One}
-ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)), lower = list(continuous = wrap("smooth", alpha=0.2, color = "orange"))) +
+```{r echo=FALSE, warning=FALSE, fig.width=10, fig.height=10, Plot_One}
+ggpairs(wqw, upper = list(continuous = wrap("cor", size = 3)),
+ lower = list(continuous = wrap("smooth", alpha=0.2,
+ color = "orange"))) +
theme_grey(base_size = 6) +
ggtitle("Scatterplot Matrix") +
theme(plot.title = element_text(size=22, hjust = 0.5))
@@ -452,34 +521,28 @@ This is a good summary of the data that we have and it shows how there is no dir
### Plot Two
```{r echo=FALSE, warning=FALSE, Plot_Two}
-ggplot(aes(x = quality, y = alcohol), data = wqw) +
- geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") +
- geom_line(stat = 'summary', fun.y = mean, color = "blue") +
- geom_line(stat = 'summary', fun.y = median, color = "black") +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1), color = 'red', linetype = 2) +
- geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9), color = 'red', linetype = 2) +
- ggtitle("Alcohol vs Quality") +
- xlab("Quality") +
- ylab("Alcohol") +
+ggplot(aes(x = alcohol, y = fixed.acidity + volatile.acidity), data = wqw) +
+ geom_point(aes(color = factor(quality))) +
+ scale_color_brewer(palette = "RdYlGn") +
+ theme_dark() +
+ labs(x = "Alcohol (%)", y = "Total Acidity (g/dm^3)", title = "Acidity vs Alcohol by Quality", color = "Quality") +
theme(plot.title = element_text(size=22, hjust = 0.5))
```
### Description Two
-Alcohol content is the closest that any of the features came to corelating with the quality and here you can see that even that corelation is very weak. The only thing we can tell is that more higher quality wines had a higher alcohol content than lower quality, but the spread on the data makes this a very weak corelation at 0.436.
+The only distiction I was able to discover was based on alcohol content and it is very slight at best. It does appear that a higher alcohol content increases the chance of a higher quality product but there is no clear distinction that can be seen. While the high quality products mostly have a higher alcohol content and low quality products have lower alcohol content the mid range products span the whole spectrum. Based on this it would be hard to determine the difference between a 6, 7, 8, or 9 quality based on the data provided. But you could probably tell the difference between a 4 and an 8.
### Plot Three
```{r echo=FALSE, warning=FALSE, Plot_Three}
-ggplot(aes(x = density, y = alcohol), data = wqw) +
- geom_point(aes(color = quality)) +
- xlim(0.985, 1.005) +
- labs(x = "Density", y = "Alcohol", title = "Alcohol vs Density by Quality", color = "Quality") +
- theme(plot.title = element_text(size=22, hjust = 0.5))
+par(mfrow=c(2,2))
+plot(m10)
+par(mfrow=c(1,1))
```
### Description Three
-I include this plot just to show how there is no clear distinction in the quality when compared to the features of the data. This is representative of all of the plots I made in the multivariate section.
+After some research it appears that the pattern shown in the Residuals vs. Fitted plot is most likely due to the fact that our dependent variable has only a few possible values. The patterns in the Scale-Location could indicate that a linear model is not the best for our data.
------
diff --git a/EDA_Project/EDA_Project.zip b/EDA_Project/EDA_Project.zip
new file mode 100644
index 0000000..e65ba27
Binary files /dev/null and b/EDA_Project/EDA_Project.zip differ