Compare commits

..

10 Commits

Author SHA1 Message Date
Dusty.P
cc617850fe Final Project Second Submission 2018-07-20 12:29:34 -08:00
Dusty.P
a9bec51e55 Final Project Submitted 2018-07-19 22:54:51 -08:00
Dusty.P
e253ac1999 Final Project without summary 2018-07-18 22:57:25 -08:00
Dusty.P
f3de84d741 Final Project Part 2018-06-26 23:17:28 -08:00
Dusty.P
860c600a9d Final Project initial Univariate plots done 2018-06-02 00:50:59 -08:00
Dusty.P
4894d728bb Start Final Project 2018-05-31 23:05:13 -08:00
Dusty.P
de414a2e37 Lesson 6 complete 2018-05-31 23:04:57 -08:00
Dusty.P
9945ad26ef Lesson 6 start 2018-05-31 00:35:10 -08:00
Dusty.P
791d543af7 Lesson 5 done 2018-05-23 23:23:20 -08:00
Dusty.P
e1f08e77f1 Lesson 5 part 2018-05-22 23:18:00 -08:00
11 changed files with 105331 additions and 108 deletions

View File

@ -1,6 +1,7 @@
<component name="ProjectDictionaryState">
<dictionary name="Dusty">
<words>
<w />
<w>.</w>
<w>...elt</w>
<w>...length</w>
@ -1114,7 +1115,6 @@
<w>`.__C__C++Function`</w>
<w>`.__C__C++Object`</w>
<w>`.__C__C++OverloadedMethods`</w>
<w />
<w>`.__C__{`</w>
<w>`.__T__!:base`</w>
<w>`.__T__$:base`</w>
@ -1463,7 +1463,6 @@
<w>`mday&lt;-`</w>
<w>`minute&lt;-`</w>
<w>`month&lt;-`</w>
<w />
<w>`on_failure&lt;-`</w>
<w>`packageSlot&lt;-`</w>
<w>`polygons&lt;-`</w>

File diff suppressed because one or more lines are too long

555
EDA_Project/EDA_Project.rmd Normal file
View File

@ -0,0 +1,555 @@
---
title: "EDA_Project"
author: "Dusty P"
date: "July 19, 2018"
output: html_document
---
```{r echo=FALSE, message=FALSE, warning=FALSE, setup}
knitr::opts_knit$set(
root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project"))
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
library(RColorBrewer)
library(bitops)
library(RCurl)
cuberoot_trans = function() trans_new('cuberoot',
transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
# Exploration of White Wines by Dustin Pianalto
This report explores a dataset containing chemical information and ratings on almost 4900 white wine tastings.
```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_Data}
# Load the Data
wqw <- read.csv('wineQualityWhites.csv')
# because the first column is just row numbers I am going to remove it
wqw <- subset(wqw, select = -X)
```
# Univariate Plots Section
```{r echo=FALSE, warning=FALSE, Data_Dimensions}
dim(wqw)
```
```{r echo=FALSE, warning=FALSE, Data_Structure}
str(wqw)
```
```{r echo=FALSE, warning=FALSE, Data_Summary}
summary(wqw)
```
Our Data consists of 11 numerical variables and one Integer attribute which is the output with almost 4900 observations
```{r echo=FALSE, warning=FALSE, quality_histogram}
ggplot(aes(x = quality), data = wqw) +
geom_histogram(binwidth = 1)
```
The distribution of the quality seems fairly normal with a peak at 6
```{r echo=FALSE, warning=FALSE, alcohol_histogram}
ggplot(aes(x = alcohol), data = wqw) +
geom_histogram(binwidth = .1)
```
The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.
```{r echo=FALSE, warning=FALSE, alcohol_histogram_log}
ggplot(aes(x = alcohol), data = wqw) +
geom_histogram(binwidth = .005) +
scale_x_log10(breaks = c(8, 9, 10, 11, 12, 13, 14))
```
```{r echo=FALSE, warning=FALSE, fixed.acidity_histogram}
ggplot(aes(x = fixed.acidity), data = wqw) +
geom_histogram(binwidth = .1)
```
The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.
```{r echo=FALSE, warning=FALSE, fixed.acidity_summary}
summary(wqw$fixed.acidity)
```
Most Wines have a acidity between 6.3 and 7.3
I am going to plot the data again removing both the high and low 1% of values to remove the outliers.
```{r echo=FALSE, warning=FALSE, fixed.acidity_histogram_quantile}
ggplot(aes(x = fixed.acidity), data = wqw) +
geom_histogram(binwidth = .1) +
xlim(quantile(wqw$fixed.acidity, 0.01), quantile(wqw$fixed.acidity, 0.99))
```
And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.
```{r echo=FALSE, warning=FALSE, volatile.acidity_histogram}
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_histogram(binwidth = .01)
```
We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.
```{r echo=FALSE, warning=FALSE, volatile.acidity_histogram_log}
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_histogram(binwidth = .04) +
scale_x_log10(breaks = seq(0.1, 1.0, 0.1))
```
```{r echo=FALSE, warning=FALSE, citric.acid_histogram}
ggplot(aes(x = citric.acid), data = wqw) +
geom_histogram(binwidth = .01) +
xlim(quantile(wqw$citric.acid, 0.01), quantile(wqw$citric.acid, 0.99))
```
There is an odd spike at about 0.49 I might want to look into that more later.
```{r echo=FALSE, warning=FALSE, residual.sugar_histogram}
ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .1) +
xlim(quantile(wqw$residual.sugar, 0.01), quantile(wqw$residual.sugar, 0.99))
```
Even with the top and bottom 1% removed the plot is still very long tailed
```{r echo=FALSE, warning=FALSE, residual.sugar_histogram_log}
p1 <- ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .05) +
scale_x_log10(breaks = c(0, 1, 2, 4, 6, 8, 12, 16, 20, 40, 65))
p2 <- ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .01) +
scale_x_log10(breaks = c(0, 1, 2, 4, 6, 8, 12, 16, 20, 40, 65))
grid.arrange(p1, p2)
```
```{r echo=FALSE, warning=FALSE, residual.sugar_summary}
summary(wqw$residual.sugar)
```
Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. This is probably is because it is harder to measure the residual sugar as a continuous scale and so the steps are more apparent at the lower, more spread out, values. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.
```{r echo=FALSE, warning=FALSE, chlorides_histogram}
ggplot(aes(x = chlorides), data = wqw) +
geom_histogram(binwidth = .001) +
xlim(0, quantile(wqw$chlorides, 0.97))
```
Here I just removed the top 3% of values to remove the long tail.
```{r echo=FALSE, warning=FALSE, sulfur.dioxide_histograms}
p1 <- ggplot(aes(x = free.sulfur.dioxide), data = wqw) +
geom_histogram(binwidth = 1) +
xlim(0, quantile(wqw$free.sulfur.dioxide, 0.99))
p2 <- ggplot(aes(x = total.sulfur.dioxide), data = wqw) +
geom_histogram(binwidth = 1) +
xlim(quantile(wqw$total.sulfur.dioxide, 0.01),
quantile(wqw$total.sulfur.dioxide, 0.99))
grid.arrange(p1, p2)
```
I plotted the Free Sulphur Dioxide and Total Sulphur Dioxide together to save room and because they are related. Note the difference in scales on both axies.
```{r echo=FALSE, warning=FALSE, density_histogram}
ggplot(aes(x = density), data = wqw) +
geom_histogram(binwidth = .0001) +
xlim(quantile(wqw$density, 0.01), quantile(wqw$density, 0.99))
```
```{r echo=FALSE, warning=FALSE, pH_histogram}
ggplot(aes(x = pH), data = wqw) +
geom_histogram(binwidth = .01)
```
The pH plot doesn't need any modification.
```{r echo=FALSE, warning=FALSE, sulphates_histogram}
ggplot(aes(x = sulphates), data = wqw) +
geom_histogram(binwidth = .01)
```
# Univariate Analysis
### What is the structure of your dataset?
There are 4898 samples in the dataset with 11 different variables and a resulting quality assesment. All of the variables are continuous number variables and the quality is an integer scale from 1 to 10 with max value of 9 and min of 3.
Observations:
* The most common quality is 6 and it is a fairly normal distribution slightly skewed towards the low end.
* Most of the variables are similar in distribution, most of them are long tailed but besides that have a fairly normal distribution.
* There are a couple interesting features though, the Citric Acid has an odd spike around 4.9 and the Residual Sugar appears to be more of a bimodal distribution.
### What is/are the main feature(s) of interest in your dataset?
My main interest in this dataset is trying to determine which features have the greatest effect on the quality.
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
I think that the Alcohol, Acidity, Density, and Ph will have the greatest impact on the quality.
### Did you create any new variables from existing variables in the dataset?
I did not create any new variables.
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
I either log transformed or removed the outliers on most of the datapoints to better view the data as most of them were longtailed.
# Bivariate Plots Section
```{r echo=FALSE, warning=FALSE, fig.width=10, fig.height=10, Bivariate_Plots}
ggpairs(wqw, upper = list(continuous = wrap("cor", size = 1.8)),
lower = list(continuous = wrap("smooth", alpha=0.2,
color = "orange"))) +
theme_grey(base_size = 6)
```
Looking at this grid there are not very many variables that appear to correlate with each other which I find suprising as some of them seem like they should. I am going to explore some of them in more detail.
One pair of variables that look like they have some correlation are density and residual sugar so I am going to start with them.
```{r echo=FALSE, warning=FALSE, residual.sugar_vs_density}
ggplot(aes(x = residual.sugar, y = density), data = wqw) +
geom_point(color = "orange")
```
Narrowing in on the main section and adding a smoothing line.
```{r echo=FALSE, warning=FALSE, residual.sugar_vs_density_mod}
ggplot(aes(x = residual.sugar, y = density), data = wqw) +
geom_point(alpha=0.3, color = "orange") +
xlim(0, 30) +
ylim(0.987, 1.0025) +
geom_smooth(method = "gam")
```
We can see a general trend as residual sugar increases the density also increases. Lets see both of these plotted against our output variable.
```{r echo=FALSE, warning=FALSE, quality_vs_density}
ggplot(aes(x = quality, y = density), data = wqw) +
ylim(0.985, 1.005) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") +
geom_line(stat = 'summary', fun.y = mean, color = "blue") +
geom_line(stat = 'summary', fun.y = median) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1),
color = 'red', linetype = 2) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9),
color = 'red', linetype = 2)
```
```{r echo=FALSE, warning=FALSE, quality_vs_residual.sugar}
ggplot(aes(y = residual.sugar, x = quality), data = wqw) +
geom_point(alpha=0.1, color = "blue") +
xlim(0, 10) +
ylim(0, 30)
```
There doesn't seem to be any direct corelation between these variables and the quality. Lets look at some others.
```{r echo=FALSE, warning=FALSE, quality_vs_alcohol}
ggplot(aes(x = quality, y = alcohol), data = wqw) +
geom_point(alpha=0.1, color = "blue")
```
```{r echo=FALSE, warning=FALSE, quality_vs_alcohol_w_summaries}
ggplot(aes(x = quality, y = alcohol), data = wqw) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") +
geom_line(stat = 'summary', fun.y = mean, color = "blue") +
geom_line(stat = 'summary', fun.y = median) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1),
color = 'red', linetype = 2) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9),
color = 'red', linetype = 2)
```
Adding jitter to the alcohol plot reveals that there could possibly be a corelation to quality but it is very weak.
```{r echo=FALSE, warning=FALSE, quality_vs_fixed.acidity}
ggplot(aes(x = quality, y = fixed.acidity), data = wqw) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue")
```
```{r echo=FALSE, warning=FALSE, quality_vs_chlorides}
ggplot(aes(x = quality, y = chlorides), data = wqw) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue") +
ylim(0, 0.1) +
geom_line(stat = 'summary', fun.y = mean, color = "blue") +
geom_line(stat = 'summary', fun.y = median) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.1),
color = 'red', linetype = 2) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = 0.9),
color = 'red', linetype = 2)
```
```{r echo=FALSE, warning=FALSE, quality_vs_tsd}
ggplot(aes(x = quality, y = total.sulfur.dioxide), data = wqw) +
geom_point(alpha=0.1, position = position_jitter(h = 0), color = "blue")
```
Looking at these other variables shows that there is little to no relationship to the quality individually I think this will change when we start combining variables in the Multivariate plots.
One other interesting corelation that I want to look at is density vs alcohol.
```{r echo=FALSE, warning=FALSE, alcohol_vs_density}
ggplot(aes(x = alcohol, y = density), data = wqw) +
geom_point(alpha=0.1, color = "blue") +
ylim(0.985, 1.005) +
geom_smooth(method = "gam")
```
Interestingly it appears that as the aocohol content increases the density decreases, this is the inverse of the residual sugar vs density that we plotted earlier. This probably has something to do with the fact that sugar is what the alcohol is created from so it would follow that as the alcohol increases the sugar and thence the density would decrease.
We can see this more directly by plotting residual sugar against alcohol.
```{r echo=FALSE, warning=FALSE, alcohol_vs_residual_sugar}
ggplot(aes(x = alcohol, y = residual.sugar), data = wqw) +
geom_point(alpha=0.1, color = "blue") +
ylim(0, 30) +
geom_smooth(method = "gam")
```
We can see that there is a seemingly exponential relationship between alcohol and residual sugar.
```{r echo=FALSE, warning=FALSE, alcohol_vs_chlorides}
ggplot(aes(x = alcohol, y = chlorides), data = wqw) +
geom_point(alpha=0.1, color = "blue") +
ylim(0, 0.1) +
geom_smooth(method = "gam")
```
There does seem to be a slight corelation between alcohol and chlorides.
```{r echo=FALSE, warning=FALSE, alcohol_vs_ph}
ggplot(aes(x = pH, y = alcohol), data = wqw) +
geom_point(alpha=0.1, color = "blue") +
geom_smooth(method = "gam")
```
```{r echo=FALSE, warning=FALSE, alcohol_vs_fixed_acidity}
ggplot(aes(x = fixed.acidity, y = alcohol), data = wqw) +
geom_point(alpha=0.1, color = "blue")
```
```{r echo=FALSE, warning=FALSE, alcohol_vs_volatile_acidity}
ggplot(aes(x = volatile.acidity, y = alcohol), data = wqw) +
geom_point(alpha=0.1, color = "blue")
```
There does not seem to be any correlation between our other features of interest.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I discovered some interesting relationships between density, residual sugar and alcohol. The other features appear to have very little corelation to each other or to the quality. The other relationships that I noted are the ones that were expected. For instance the pH has a mild corelation to the fixed acidity although I expected a higher corelation. Same with total sulfur dioxide and free sulfur dioxide.
It does seem like there is a mild corelation between the quality and alcohol as well as quality and density which are 2 of the features I noted in the previous section. There also might be a slight relationship between quality and chlorides.
### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
One relationship that I found interesting is between alcohol and chlorides as well as between chlorides and quality. I wonder if this will show itself more in the multivariate exploration.
### What was the strongest relationship you found?
By far the strongest relationship I found was between density and residual sugar.
# Multivariate Plots Section
Since there seems to be a relationship bewteen alcohol and chlorides as well as chlorides and quality lets take a look at that relationship first.
```{r echo=FALSE, warning=FALSE, alcohol_chlorides_quality}
ggplot(aes(x = alcohol, y = chlorides), data = wqw) +
geom_point(aes(color = factor(quality))) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark()
```
I find this to be suprising. I expected at least a mild distiction in this plot but it only shows a general trend that the higher the alcohol the more likely to have a higher quality but there isn't anything here we can use to make accurate predictions.
Lets take a look at some other relationships we identified earlier.
```{r echo=FALSE, warning=FALSE, alcohol_residual.sugar_quality}
ggplot(aes(x = alcohol, y = residual.sugar), data = wqw) +
geom_point(aes(color = factor(quality))) +
ylim(0, 30) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark()
```
Again just a higher chance for a higher quality as the alcohol increases. It doesn't look like the residual sugar plays into it much at all.
```{r echo=FALSE, warning=FALSE, density_pH_quality}
ggplot(aes(x = density, y = pH), data = wqw) +
geom_point(aes(color = factor(quality))) +
xlim(0.985, 1.005) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark()
```
There is no real discinction here, possibly a slightly higher chance for high quality at a lower density. But apparently pH doesn't matter at all.
```{r echo=FALSE, warning=FALSE, free.sulfur.dioxide_fixed.acidity_quality}
ggplot(aes(x = free.sulfur.dioxide, y = fixed.acidity), data = wqw) +
geom_point(aes(color = factor(quality))) +
xlim(0, 100) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark()
```
It looks like there might be a trend towards lower fixed acidity. I wonder about a combination of fixed and volatile acidity when combined with alcohol.
```{r echo=FALSE, warning=FALSE, alcohol_fixed.volatile.acidity_quality}
ggplot(aes(x = alcohol, y = fixed.acidity + volatile.acidity), data = wqw) +
geom_point(aes(color = factor(quality))) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark()
```
Doesn't really appear to be any different than just alcohol content. There might be a slight trend towards lower acidity.
```{r echo=FALSE, warning=FALSE, alcohol_density_quality}
ggplot(aes(x = alcohol, y = density), data = wqw) +
geom_point(aes(color = factor(quality)), position = position_jitter(h = 0)) +
ylim(0.985, 1.005) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark()
```
These last two plots are really the only ones that I have tried that seems to indicate any sort of corelation between any of the variables and the quality and it is very weak. The quality is only slightly squewed towards higher alcohol and lower density or lower acidity.
Lets see if a linear model can make any predictions.
```{r echo=FALSE, warning=FALSE, Building_the_Linear_Model}
m1 <- lm(quality ~ alcohol, data = wqw)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + residual.sugar)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + sulphates)
m6 <- update(m5, ~ . + pH)
m7 <- update(m6, ~ . + fixed.acidity + volatile.acidity)
m8 <- update(m7, ~ . + citric.acid)
m9 <- update(m8, ~ . + free.sulfur.dioxide)
m10 <- update(m9, ~ . + total.sulfur.dioxide)
mtable(m1, m2, m5, m7, m9, m10, sdigits = 3)
```
```{r echo=FALSE, warning=FALSE, Plotting_Residuals}
par(mfrow=c(2,2))
plot(m10)
par(mfrow=c(1,1))
```
Looking at the residuals plots there appears to be one outlier that could be effecting the output of the model so I am going to remove that datapoint and re-run the model.
```{r echo=FALSE, warning=FALSE, Building_the_Linear_Model_2}
wqw.new = wqw[-2782,]
m1 <- lm(quality ~ alcohol, data = wqw.new)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + residual.sugar)
m4 <- update(m3, ~ . + chlorides)
m5 <- update(m4, ~ . + sulphates)
m6 <- update(m5, ~ . + pH)
m7 <- update(m6, ~ . + fixed.acidity + volatile.acidity)
m8 <- update(m7, ~ . + citric.acid)
m9 <- update(m8, ~ . + free.sulfur.dioxide)
m10 <- update(m9, ~ . + total.sulfur.dioxide)
mtable(m1, m2, m5, m7, m9, m10, sdigits = 3)
```
```{r echo=FALSE, warning=FALSE, Plotting_Residuals_2}
par(mfrow=c(2,2))
plot(m10)
par(mfrow=c(1,1))
```
We got a very slight increase to the model but not very much and it looks like we got rid of all the major outliers.
As we can see even when taking into account every feature and removing the outlier the R-squared is still only 0.285 which is dismal at best and indicates that we can not make any predictions based on the data that we have.
(I had to remove some of the intermediary steps to make the model fit on the page.)
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
All of the features that I investigated in this section show a dramatic lack of corelation. Even when combining features in different ways there was little to no interaction.
There were a few things that I discovered earlier that were confirmed but there wasn't really anything new to explore.
### Were there any interesting or surprising interactions between features?
The only interesting thing was the complete lack of interesting interactions between features.
### OPTIONAL: Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
I did create a basic model and it was not able to predict anything. The main limitation of the model is that none of the features are corelated to the quality in any meaningful way.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, warning=FALSE, fig.width=10, fig.height=10, Plot_One}
ggpairs(wqw, upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("smooth", alpha=0.2,
color = "orange"))) +
theme_grey(base_size = 6) +
ggtitle("Scatterplot Matrix") +
theme(plot.title = element_text(size=22, hjust = 0.5))
```
### Description One
This is a good summary of the data that we have and it shows how there is no direct corelation between any of the variables and the quality. You can see some moderate corelation between some of the features such as residual sugar and density. Some of these corelations are something I focused on.
### Plot Two
```{r echo=FALSE, warning=FALSE, Plot_Two}
ggplot(aes(x = alcohol, y = fixed.acidity + volatile.acidity), data = wqw) +
geom_point(aes(color = factor(quality))) +
scale_color_brewer(palette = "RdYlGn") +
theme_dark() +
labs(x = "Alcohol (%)", y = "Total Acidity (g/dm^3)", title = "Acidity vs Alcohol by Quality", color = "Quality") +
theme(plot.title = element_text(size=22, hjust = 0.5))
```
### Description Two
The only distiction I was able to discover was based on alcohol content and it is very slight at best. It does appear that a higher alcohol content increases the chance of a higher quality product but there is no clear distinction that can be seen. While the high quality products mostly have a higher alcohol content and low quality products have lower alcohol content the mid range products span the whole spectrum. Based on this it would be hard to determine the difference between a 6, 7, 8, or 9 quality based on the data provided. But you could probably tell the difference between a 4 and an 8.
### Plot Three
```{r echo=FALSE, warning=FALSE, Plot_Three}
par(mfrow=c(2,2))
plot(m10)
par(mfrow=c(1,1))
```
### Description Three
After some research it appears that the pattern shown in the Residuals vs. Fitted plot is most likely due to the fact that our dependent variable has only a few possible values. The patterns in the Scale-Location could indicate that a linear model is not the best for our data.
------
# Reflection
The Wine dataset that I used contained information from almost 5,000 wine tastings with their quality rating included. Initially I examined the data to see the shape of each of the features and then started exploring how they interact with each other. Then I compared the features against the quality to see if any of the features could help to predict the quality of the product. Finally I created a linear model to see if there was anything I missed in the data that could create predictions.
In the beginning I thought that the quality would have something to do with the alcohol, density, pH, and acidity. As I examined the data it became more and more clear that there was little to no correlation between any of the features and the quality. I found this suprising and really wanted to find any little thing that would point towards a corelation but nothing showed up. Finally when I created the linear model it was clear that you could not predict the quality of the wine from the data that we have in this dataset. We do see some small corelation between the alcohol content and the quality, it appears that the higher the alcohol content the more likely the wine will have a higher quality but there is definately not enough destinction to make any predictions.
I don't know if more datapoints could make a difference but it seems at this point that the quality of wine is subjective and is difficult if not impossible to predict. I might be able to improve the models with more manipulation of the data but other models that I have seen max out at ~70% accuracy such as PennState's STAT 897D Analysis of Wine Quality Data (https://onlinecourses.science.psu.edu/stat857/node/223/), and R-bloggers Predicting wine quality using Random Forests (https://www.r-bloggers.com/predicting-wine-quality-using-random-forests/) which use a lot more complex modeling than a basic linear model. R-bloggers use a random forest and acheived a 71.5% accuracy which is still very low for making predictions. They also had to modify the quality variable into groups where 3, 4, and 5 were considered low quality, 6 and 7 were medium, and 8 and 9 were high quality. I personally find this unacceptable as it degrades the quality of the output and artifically pushes the prediction rate higher. On a scale of 1-10 the difference between 8 and 9 can be quite substantial.

BIN
EDA_Project/EDA_Project.zip Normal file

Binary file not shown.

View File

@ -0,0 +1,173 @@
TITLE by YOUR_NAME_HERE
========================================================
> **Tip**: You will see quoted sections like this throughout the template to
help you construct your report. Make sure that you remove these notes before
you finish and submit your project!
> **Tip**: One of the requirements of this project is that your code follows
good formatting techniques, including limiting your lines to 80 characters or
less. If you're using RStudio, go into Preferences \> Code \> Display to set up
a margin line to help you keep track of this guideline!
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
```
```{r echo=FALSE, Load_the_Data}
# Load the Data
```
> **Tip**: Before you create any plots, it is a good idea to provide a short
introduction into the dataset that you are planning to explore. Replace this
quoted text with that general information!
# Univariate Plots Section
> **Tip**: In this section, you should perform some preliminary exploration of
your dataset. Run some summaries of the data and create univariate plots to
understand the structure of the individual variables in your dataset. Don't
forget to add a comment after each plot or closely-related group of plots!
There should be multiple code chunks and text sections; the first one below is
just to help you get started.
```{r echo=FALSE, Univariate_Plots}
```
> **Tip**: Make sure that you leave a blank line between the start / end of
each code block and the end / start of your Markdown text so that it is
formatted nicely in the knitted text. Note as well that text on consecutive
lines is treated as a single space. Make sure you have a blank line between
your paragraphs so that they too are formatted for easy readability.
# Univariate Analysis
> **Tip**: Now that you've completed your univariate explorations, it's time to
reflect on and summarize what you've found. Use the questions below to help you
gather your observations and add your own if you have other thoughts!
### What is the structure of your dataset?
### What is/are the main feature(s) of interest in your dataset?
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
### Did you create any new variables from existing variables in the dataset?
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
# Bivariate Plots Section
> **Tip**: Based on what you saw in the univariate plots, what relationships
between variables might be interesting to look at in this section? Don't limit
yourself to relationships between a main output feature and one of the
supporting variables. Try to look at relationships between supporting variables
as well.
```{r echo=FALSE, Bivariate_Plots}
```
# Bivariate Analysis
> **Tip**: As before, summarize what you found in your bivariate explorations
here. Use the questions below to guide your discussion.
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features in \
the dataset?
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
### What was the strongest relationship you found?
# Multivariate Plots Section
> **Tip**: Now it's time to put everything together. Based on what you found in
the bivariate plots section, create a few multivariate plots to investigate
more complex interactions between variables. Make sure that the plots that you
create here are justified by the plots you explored in the previous section. If
you plan on creating any mathematical models, this is the section where you
will do that.
```{r echo=FALSE, Multivariate_Plots}
```
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
### Were there any interesting or surprising interactions between features?
### OPTIONAL: Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
------
# Final Plots and Summary
> **Tip**: You've done a lot of exploration and have built up an understanding
of the structure of and relationships between the variables in your dataset.
Here, you will select three plots from all of your previous exploration to
present here as a summary of some of your most interesting findings. Make sure
that you have refined your selected plots for good titling, axis labels (with
units), and good aesthetic choices (e.g. color, transparency). After each plot,
make sure you justify why you chose each plot by describing what it shows.
### Plot One
```{r echo=FALSE, Plot_One}
```
### Description One
### Plot Two
```{r echo=FALSE, Plot_Two}
```
### Description Two
### Plot Three
```{r echo=FALSE, Plot_Three}
```
### Description Three
------
# Reflection
> **Tip**: Here's the final step! Reflect on the exploration you performed and
the insights you found. What were some of the struggles that you went through?
What went well? What was surprising? Make sure you include an insight into
future work that could be done with the dataset.
> **Tip**: Don't forget to remove this, and the other **Tip** sections before
saving your final work and knitting the final report!

File diff suppressed because it is too large Load Diff

BIN
lesson5/lesson5_student.pdf Normal file

Binary file not shown.

View File

@ -1,3 +1,17 @@
---
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/lesson5"))
library(ggplot2)
library(dplyr)
library(reshape2)
```
Lesson 5
========================================================
@ -15,8 +29,17 @@ Notes:
Notes:
```{r Third Qualitative Variable}
ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) + geom_histogram()
pf = read.csv('pseudo_facebook.tsv', sep = '\t')
pf.fc_by_age_gender <- pf %>%
filter(!is.na(gender)) %>%
group_by(age, gender) %>%
summarize(mean_friend_count = mean(friend_count),
median_friend_count = median(friend_count),
n = n()) %>%
ungroup() %>%
arrange(age)
head(pf.fc_by_age_gender)
```
***
@ -25,7 +48,8 @@ ggplot(aes(x = gender, y = age),
Notes:
```{r Plotting Conditional Summaries}
ggplot(aes(x = age, y = median_friend_count), data = pf.fc_by_age_gender) +
geom_line(aes(color = gender))
```
***
@ -33,7 +57,7 @@ Notes:
### Thinking in Ratios
Notes:
***
What is the ratio of friends for males vs females
### Wide and Long Format
Notes:
@ -44,8 +68,12 @@ Notes:
Notes:
```{r}
install.packages('reshape2')
library(reshape2)
#install.packages('reshape2')
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age ~ gender,
value.var = 'median_friend_count')
head(pf.fc_by_age_gender.wide)
```
@ -55,7 +83,9 @@ library(reshape2)
Notes:
```{r Ratio Plot}
ggplot(aes(x = age, y = female/male), data = pf.fc_by_age_gender.wide) +
geom_line() +
geom_hline(aes(yintercept = 1), alpha=0.3, linetype = 2)
```
***
@ -64,7 +94,8 @@ Notes:
Notes:
```{r Third Quantitative Variable}
pf$year_joined <- floor(2014 - pf$tenure/365)
head(pf)
```
***
@ -73,7 +104,8 @@ Notes:
Notes:
```{r Cut a Variable}
pf$year_joined.bucket = cut(pf$year_joined, c(2004, 2009, 2011, 2012, 2014))
table(pf$year_joined.bucket)
```
***
@ -82,7 +114,8 @@ Notes:
Notes:
```{r Plotting it All Together}
ggplot(aes(x = age, y = friend_count), data = subset(pf, !is.na(year_joined.bucket))) +
geom_line(aes(color = year_joined.bucket), stat='summary', fun.y = median)
```
***
@ -91,6 +124,9 @@ Notes:
Notes:
```{r Plot the Grand Mean}
ggplot(aes(x = age, y = friend_count), data = subset(pf, !is.na(year_joined.bucket))) +
geom_line(aes(color = year_joined.bucket), stat='summary', fun.y = mean) +
geom_line(stat = 'summary', fun.y = mean, linetype = 2)
```
@ -100,7 +136,7 @@ Notes:
Notes:
```{r Friending Rate}
with(subset(pf, tenure >= 1), summary(friend_count / tenure))
```
***
@ -109,11 +145,14 @@ Notes:
Notes:
What is the median friend rate?
0.2205
What is the maximum friend rate?
417.0
```{r Friendships Initiated}
ggplot(aes(y = friendships_initiated/tenure, x = tenure), data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean)
```
***
@ -123,29 +162,33 @@ Notes:
```{r Bias-Variance Tradeoff Revisited}
#ggplot(aes(x = tenure, y = friendships_initiated / tenure),
# data = subset(pf, tenure >= 1)) +
# geom_line(aes(color = year_joined.bucket),
# stat = 'summary',
# fun.y = mean)
#
#ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
# data = subset(pf, tenure > 0)) +
# geom_line(aes(color = year_joined.bucket),
# stat = "summary",
# fun.y = mean)
#
#ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
# data = subset(pf, tenure > 0)) +
# geom_line(aes(color = year_joined.bucket),
# stat = "summary",
# fun.y = mean)
#
#ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
# data = subset(pf, tenure > 0)) +
# geom_line(aes(color = year_joined.bucket),
# stat = "summary",
# fun.y = mean)
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
geom_smooth(aes(color = year_joined.bucket))
```
@ -165,7 +208,13 @@ Notes:
Notes:
```{r Histograms Revisited}
yo <- read.csv('yogurt.csv')
yo$id <- factor(yo$id)
head(yo)
ggplot(aes(x = price), data = yo) +
geom_histogram()
```
***
@ -174,7 +223,8 @@ Notes:
Notes:
```{r Number of Purchases}
yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)
head(yo)
```
***
@ -183,7 +233,11 @@ Notes:
Notes:
```{r Prices over Time}
ggplot(aes(x = time, y = price), data = yo) +
geom_point(alpha = 1/10)
ggplot(aes(x = time, y = price), data = yo) +
geom_point(alpha = 1/10, aes(color = all.purchases))
```
***
@ -196,7 +250,13 @@ Notes:
### Looking at Samples of Households
```{r Looking at Sample of Households}
set.seed(1)
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(x = time, y = price), data = subset(yo, id %in% sample.ids)) +
facet_wrap( ~ id) +
geom_line() +
geom_point(aes(size = all.purchases), pch = 1)
```
***
@ -214,7 +274,16 @@ Notes:
### Scatterplot Matrix
Notes:
***
```{r}
#install.packages('GGally')
library(GGally)
theme_set(theme_minimal(20))
set.seed(1836)
pf_subset <- pf[, c(2:15)]
names(pf_subset)
ggpairs(pf_subset[sample.int(nrow(pf_subset), 1000), ])
```
### Even More Variables
Notes:
@ -230,7 +299,7 @@ colnames(nci) <- c(1:64)
```
```{r}
nci.long.samp <- melt(as.matrix(nci[1:200,]))
nci.long.samp <- melt(as.matrix(nci[1:2000,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)

99004
lesson5/pseudo_facebook.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
lesson6/BigDiamonds.Rda Normal file

Binary file not shown.

View File

@ -4,12 +4,43 @@ Lesson 6
### Welcome
Notes:
***
```{r Setup}
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
#install.packages('RColorBrewer', dependencies = TRUE)
#install.packages('bitops')
#install.packages('RCurl')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
library(RColorBrewer)
library(bitops)
library(RCurl)
data(diamonds)
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
### Scatterplot Review
```{r Scatterplot Review}
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
xlim(0, quantile(diamonds$carat, 0.99)) +
ylim(0, quantile(diamonds$price, 0.99))
```
***
@ -17,15 +48,13 @@ Notes:
### Price and Carat Relationship
Response:
***
The price increases as the carat increases but it also gains more variability
### Frances Gerety
Notes:
#### A diamonds is
***
Forever
### The Rise of Diamonds
Notes:
@ -36,38 +65,33 @@ Notes:
Notes:
```{r ggpairs Function}
# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
ggpairs(diamond_samp,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
```
What are some things you notice in the ggpairs output?
Response:
***
There seems to be some Clarity and Colors that draw a higher price but besides that the size seems to have the largest correlation.
### The Demand of Diamonds
Notes:
```{r The Demand of Diamonds}
plot1 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram() +
ggtitle('Price')
plot2 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram() +
scale_x_log10() +
ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol = 2)
```
***
@ -75,7 +99,7 @@ Notes:
### Connecting Demand and Price Distributions
Notes:
***
There are 2 categories of diamond buyers that are looking for different types
### Scatterplot Transformation
@ -87,8 +111,6 @@ Notes:
### Create a new function to transform the carat variable
```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
#### Use the cuberoot_trans function
@ -107,13 +129,14 @@ ggplot(aes(carat, price), data = diamonds) +
### Overplotting Revisited
```{r Sort and Head Tables}
head(sort(table(diamonds$carat), decreasing = T))
head(sort(table(diamonds$price), decreasing = T))
```
```{r Overplotting Revisited}
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
geom_point(alpha = 1/2, size = 3/4, position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
@ -132,12 +155,8 @@ Notes:
Alter the code below.
```{r Price vs. Carat and Clarity}
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
geom_point(alpha = 0.5, size = 1, position = 'jitter', aes(color = diamonds$clarity)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
@ -153,39 +172,18 @@ ggplot(aes(x = carat, y = price), data = diamonds) +
### Clarity and Price
Response:
***
No, I don't think it explains the difference because the clarity differential seems to parallel the price/carat relationship.
This question is worded poorly. For the answer they wanted the question should be "Do you think clarity explains some of the variance in price for a given carat?"
### Price vs. Carat and Cut
Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### Cut and Price
Response:
***
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = Cut, reverse = T,
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
@ -196,10 +194,33 @@ ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
***
### Cut and Price
Response:
Some, but not much. The better cuts tend to have a higher price but some of the worse cuts have a higher price than the better cuts.
### Price vs. Carat and Color
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
```
***
### Color and Price
Response:
***
Yes, same reason as Clarity
### Linear Models in R
Notes:
@ -217,7 +238,7 @@ m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
```
Notice how adding cut to our model does not help explain much of the variance
@ -234,19 +255,21 @@ Research:
Response:
***
The main problem I see is the problem of supply and demand over time. As demand goes up and supply goes down the price goes up, as supply goes up or demand goes down the price goes down. Another problem is where on the supply chain the diamond is. Is it a rough cut, or polished? Is it wholesale or go through a distributor?
### A Bigger, Better Data Set
Notes:
```{r A Bigger, Better Data Set}
install.package('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
```
The code used to obtain the data is available here:
@ -273,11 +296,12 @@ thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate)
```
Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
***
The 95% CI is very large but does encompass the correct value. The fit is relatively far off from the actual value at over a 10% difference between the actual value and predicted value.
## Final Thoughts
Notes: