303 lines
9.6 KiB
Plaintext
303 lines
9.6 KiB
Plaintext
---
|
|
title: "EDA_Project"
|
|
author: "Dusty P"
|
|
date: "May 31, 2018"
|
|
output: html_document
|
|
---
|
|
|
|
```{r echo=FALSE, message=FALSE, warning=FALSE, setup}
|
|
knitr::opts_knit$set(root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project"))
|
|
|
|
# load the ggplot graphics package and the others
|
|
library(ggplot2)
|
|
library(GGally)
|
|
library(scales)
|
|
library(memisc)
|
|
library(gridExtra)
|
|
library(RColorBrewer)
|
|
library(bitops)
|
|
library(RCurl)
|
|
|
|
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
|
|
inverse = function(x) x^3)
|
|
```
|
|
|
|
# Exploration of White Wines by Dustin Pianalto
|
|
|
|
This report explores a dataset containing chemical information and ratings on almost 4900 white wine tastings.
|
|
|
|
```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_Data}
|
|
# Load the Data
|
|
wqw <- read.csv('wineQualityWhites.csv')
|
|
# because the first column is just row numbers I am going to remove it
|
|
wqw <- subset(wqw, select = -X)
|
|
```
|
|
|
|
# Univariate Plots Section
|
|
|
|
```{r echo=FALSE, Data_Dimensions}
|
|
dim(wqw)
|
|
```
|
|
|
|
```{r echo=FALSE, Data_Structure}
|
|
str(wqw)
|
|
```
|
|
|
|
```{r echo=False, Data_Summary}
|
|
summary(wqw)
|
|
```
|
|
|
|
Our Data consists of 11 numerical variables and one Integer attribute which is the output with almost 4900 observations
|
|
|
|
```{r echo=FALSE, quality_histogram}
|
|
ggplot(aes(x = quality), data = wqw) +
|
|
geom_histogram(binwidth = 1)
|
|
```
|
|
|
|
The distribution of the quality seems fairly normal with a peak at 6
|
|
|
|
```{r echo=FALSE, alcohol_histogram}
|
|
ggplot(aes(x = alcohol), data = wqw) +
|
|
geom_histogram(binwidth = .1)
|
|
```
|
|
|
|
The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.
|
|
|
|
```{r echo=FALSE, alcohol_histogram}
|
|
ggplot(aes(x = alcohol), data = wqw) +
|
|
geom_histogram(binwidth = .005) +
|
|
scale_x_log10()
|
|
```
|
|
|
|
```{r echo=FALSE, fixed.acidity_histogram}
|
|
ggplot(aes(x = fixed.acidity), data = wqw) +
|
|
geom_histogram(binwidth = .1)
|
|
```
|
|
|
|
The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.
|
|
|
|
```{r echo=FALSE, fixed.acidity_summary}
|
|
summary(wqw$fixed.acidity)
|
|
```
|
|
|
|
Most Wines have a acidity between 6.3 and 7.3
|
|
I am going to plot the data again removing both the high and low 1% of values to remove the outliers.
|
|
|
|
```{r echo=FALSE, fixed.acidity_histogram}
|
|
ggplot(aes(x = fixed.acidity), data = wqw) +
|
|
geom_histogram(binwidth = .1) +
|
|
xlim(quantile(wqw$fixed.acidity, 0.01), quantile(wqw$fixed.acidity, 0.99))
|
|
```
|
|
|
|
And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.
|
|
|
|
```{r echo=FALSE, volatile.acidity_histogram}
|
|
ggplot(aes(x = volatile.acidity), data = wqw) +
|
|
geom_histogram(binwidth = .01)
|
|
```
|
|
|
|
We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.
|
|
|
|
```{r echo=FALSE, volatile.acidity_histogram}
|
|
ggplot(aes(x = volatile.acidity), data = wqw) +
|
|
geom_histogram(binwidth = .04) +
|
|
scale_x_log10()
|
|
```
|
|
|
|
```{r echo=FALSE, citric.acid_histogram}
|
|
ggplot(aes(x = citric.acid), data = wqw) +
|
|
geom_histogram(binwidth = .01) +
|
|
xlim(quantile(wqw$citric.acid, 0.01), quantile(wqw$citric.acid, 0.99))
|
|
```
|
|
|
|
There is an odd spike at about 0.49 I might want to look into that more later.
|
|
|
|
```{r echo=FALSE, residual.sugar_histogram}
|
|
ggplot(aes(x = residual.sugar), data = wqw) +
|
|
geom_histogram(binwidth = .1) +
|
|
xlim(quantile(wqw$residual.sugar, 0.01), quantile(wqw$residual.sugar, 0.99))
|
|
```
|
|
|
|
Even with the top and bottom 1% removed the plot is still very long tailed
|
|
|
|
```{r echo=FALSE, residual.sugar_histogram}
|
|
p1 <- ggplot(aes(x = residual.sugar), data = wqw) +
|
|
geom_histogram(binwidth = .05) +
|
|
scale_x_log10()
|
|
p2 <- ggplot(aes(x = residual.sugar), data = wqw) +
|
|
geom_histogram(binwidth = .01) +
|
|
scale_x_log10(breaks = seq(0, 20, 2))
|
|
grid.arrange(p1, p2)
|
|
```
|
|
|
|
```{r echo=FALSE, residual.sugar_summary}
|
|
summary(wqw$residual.sugar)
|
|
```
|
|
|
|
Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.
|
|
|
|
```{r echo=FALSE, chlorides_histogram}
|
|
ggplot(aes(x = chlorides), data = wqw) +
|
|
geom_histogram(binwidth = .001) +
|
|
xlim(0, quantile(wqw$chlorides, 0.97))
|
|
```
|
|
|
|
Here I just removed the top 3% of values to remove the long tail.
|
|
|
|
```{r echo=FALSE, sulfur.dioxide_histograms}
|
|
p1 <- ggplot(aes(x = free.sulfur.dioxide), data = wqw) +
|
|
geom_histogram(binwidth = 1) +
|
|
xlim(0, quantile(wqw$free.sulfur.dioxide, 0.99))
|
|
p2 <- ggplot(aes(x = total.sulfur.dioxide), data = wqw) +
|
|
geom_histogram(binwidth = 1) +
|
|
xlim(quantile(wqw$total.sulfur.dioxide, 0.01), quantile(wqw$total.sulfur.dioxide, 0.99))
|
|
grid.arrange(p1, p2)
|
|
```
|
|
|
|
I plotted the Free Sulphur Dioxide and Total Sulphur Dioxide together to save room and because they are related. Note the difference in scales on both axies.
|
|
|
|
```{r echo=FALSE, density_histogram}
|
|
ggplot(aes(x = density), data = wqw) +
|
|
geom_histogram(binwidth = .0001) +
|
|
xlim(quantile(wqw$density, 0.01), quantile(wqw$density, 0.99))
|
|
```
|
|
|
|
```{r echo=FALSE, pH_histogram}
|
|
ggplot(aes(x = pH), data = wqw) +
|
|
geom_histogram(binwidth = .01)
|
|
```
|
|
|
|
The pH plot doesn't need any modification.
|
|
|
|
```{r echo=FALSE, sulphates_histogram}
|
|
ggplot(aes(x = sulphates), data = wqw) +
|
|
geom_histogram(binwidth = .01)
|
|
```
|
|
|
|
|
|
|
|
density pH sulphates
|
|
|
|
> **Tip**: Make sure that you leave a blank line between the start / end of
|
|
each code block and the end / start of your Markdown text so that it is
|
|
formatted nicely in the knitted text. Note as well that text on consecutive
|
|
lines is treated as a single space. Make sure you have a blank line between
|
|
your paragraphs so that they too are formatted for easy readability.
|
|
|
|
# Univariate Analysis
|
|
|
|
> **Tip**: Now that you've completed your univariate explorations, it's time to
|
|
reflect on and summarize what you've found. Use the questions below to help you
|
|
gather your observations and add your own if you have other thoughts!
|
|
|
|
### What is the structure of your dataset?
|
|
|
|
### What is/are the main feature(s) of interest in your dataset?
|
|
|
|
### What other features in the dataset do you think will help support your \
|
|
investigation into your feature(s) of interest?
|
|
|
|
### Did you create any new variables from existing variables in the dataset?
|
|
|
|
### Of the features you investigated, were there any unusual distributions? \
|
|
Did you perform any operations on the data to tidy, adjust, or change the form \
|
|
of the data? If so, why did you do this?
|
|
|
|
|
|
# Bivariate Plots Section
|
|
|
|
> **Tip**: Based on what you saw in the univariate plots, what relationships
|
|
between variables might be interesting to look at in this section? Don't limit
|
|
yourself to relationships between a main output feature and one of the
|
|
supporting variables. Try to look at relationships between supporting variables
|
|
as well.
|
|
|
|
```{r echo=FALSE, Bivariate_Plots}
|
|
|
|
```
|
|
|
|
# Bivariate Analysis
|
|
|
|
> **Tip**: As before, summarize what you found in your bivariate explorations
|
|
here. Use the questions below to guide your discussion.
|
|
|
|
### Talk about some of the relationships you observed in this part of the \
|
|
investigation. How did the feature(s) of interest vary with other features in \
|
|
the dataset?
|
|
|
|
### Did you observe any interesting relationships between the other features \
|
|
(not the main feature(s) of interest)?
|
|
|
|
### What was the strongest relationship you found?
|
|
|
|
|
|
# Multivariate Plots Section
|
|
|
|
> **Tip**: Now it's time to put everything together. Based on what you found in
|
|
the bivariate plots section, create a few multivariate plots to investigate
|
|
more complex interactions between variables. Make sure that the plots that you
|
|
create here are justified by the plots you explored in the previous section. If
|
|
you plan on creating any mathematical models, this is the section where you
|
|
will do that.
|
|
|
|
```{r echo=FALSE, Multivariate_Plots}
|
|
|
|
```
|
|
|
|
# Multivariate Analysis
|
|
|
|
### Talk about some of the relationships you observed in this part of the \
|
|
investigation. Were there features that strengthened each other in terms of \
|
|
looking at your feature(s) of interest?
|
|
|
|
### Were there any interesting or surprising interactions between features?
|
|
|
|
### OPTIONAL: Did you create any models with your dataset? Discuss the \
|
|
strengths and limitations of your model.
|
|
|
|
------
|
|
|
|
# Final Plots and Summary
|
|
|
|
> **Tip**: You've done a lot of exploration and have built up an understanding
|
|
of the structure of and relationships between the variables in your dataset.
|
|
Here, you will select three plots from all of your previous exploration to
|
|
present here as a summary of some of your most interesting findings. Make sure
|
|
that you have refined your selected plots for good titling, axis labels (with
|
|
units), and good aesthetic choices (e.g. color, transparency). After each plot,
|
|
make sure you justify why you chose each plot by describing what it shows.
|
|
|
|
### Plot One
|
|
```{r echo=FALSE, Plot_One}
|
|
|
|
```
|
|
|
|
### Description One
|
|
|
|
|
|
### Plot Two
|
|
```{r echo=FALSE, Plot_Two}
|
|
|
|
```
|
|
|
|
### Description Two
|
|
|
|
|
|
### Plot Three
|
|
```{r echo=FALSE, Plot_Three}
|
|
|
|
```
|
|
|
|
### Description Three
|
|
|
|
------
|
|
|
|
# Reflection
|
|
|
|
> **Tip**: Here's the final step! Reflect on the exploration you performed and
|
|
the insights you found. What were some of the struggles that you went through?
|
|
What went well? What was surprising? Make sure you include an insight into
|
|
future work that could be done with the dataset.
|
|
|
|
> **Tip**: Don't forget to remove this, and the other **Tip** sections before
|
|
saving your final work and knitting the final report! |