udacity_eda/EDA_Project/EDA_Project.rmd

303 lines
9.6 KiB
Plaintext

---
title: "EDA_Project"
author: "Dusty P"
date: "May 31, 2018"
output: html_document
---
```{r echo=FALSE, message=FALSE, warning=FALSE, setup}
knitr::opts_knit$set(root.dir = normalizePath("C:/Users/Dusty/Documents/coding/projects/Udacity/Data Analysis/eda/EDA_Project"))
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
library(RColorBrewer)
library(bitops)
library(RCurl)
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
```
# Exploration of White Wines by Dustin Pianalto
This report explores a dataset containing chemical information and ratings on almost 4900 white wine tastings.
```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_Data}
# Load the Data
wqw <- read.csv('wineQualityWhites.csv')
# because the first column is just row numbers I am going to remove it
wqw <- subset(wqw, select = -X)
```
# Univariate Plots Section
```{r echo=FALSE, Data_Dimensions}
dim(wqw)
```
```{r echo=FALSE, Data_Structure}
str(wqw)
```
```{r echo=False, Data_Summary}
summary(wqw)
```
Our Data consists of 11 numerical variables and one Integer attribute which is the output with almost 4900 observations
```{r echo=FALSE, quality_histogram}
ggplot(aes(x = quality), data = wqw) +
geom_histogram(binwidth = 1)
```
The distribution of the quality seems fairly normal with a peak at 6
```{r echo=FALSE, alcohol_histogram}
ggplot(aes(x = alcohol), data = wqw) +
geom_histogram(binwidth = .1)
```
The Alcohol seems to be slightly long tailed, I want to see what it is like with a log transformation.
```{r echo=FALSE, alcohol_histogram}
ggplot(aes(x = alcohol), data = wqw) +
geom_histogram(binwidth = .005) +
scale_x_log10()
```
```{r echo=FALSE, fixed.acidity_histogram}
ggplot(aes(x = fixed.acidity), data = wqw) +
geom_histogram(binwidth = .1)
```
The fixed.acidity definately has some outliers but besides that has a pretty normal distribution.
```{r echo=FALSE, fixed.acidity_summary}
summary(wqw$fixed.acidity)
```
Most Wines have a acidity between 6.3 and 7.3
I am going to plot the data again removing both the high and low 1% of values to remove the outliers.
```{r echo=FALSE, fixed.acidity_histogram}
ggplot(aes(x = fixed.acidity), data = wqw) +
geom_histogram(binwidth = .1) +
xlim(quantile(wqw$fixed.acidity, 0.01), quantile(wqw$fixed.acidity, 0.99))
```
And we see a fairly normal distribution with a peak around 6.8 which matches both the median (6.8) and mean (6.855) from the summary above.
```{r echo=FALSE, volatile.acidity_histogram}
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_histogram(binwidth = .01)
```
We have another long tailed distribution. I am going to plot again with a log_10 transformation this time.
```{r echo=FALSE, volatile.acidity_histogram}
ggplot(aes(x = volatile.acidity), data = wqw) +
geom_histogram(binwidth = .04) +
scale_x_log10()
```
```{r echo=FALSE, citric.acid_histogram}
ggplot(aes(x = citric.acid), data = wqw) +
geom_histogram(binwidth = .01) +
xlim(quantile(wqw$citric.acid, 0.01), quantile(wqw$citric.acid, 0.99))
```
There is an odd spike at about 0.49 I might want to look into that more later.
```{r echo=FALSE, residual.sugar_histogram}
ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .1) +
xlim(quantile(wqw$residual.sugar, 0.01), quantile(wqw$residual.sugar, 0.99))
```
Even with the top and bottom 1% removed the plot is still very long tailed
```{r echo=FALSE, residual.sugar_histogram}
p1 <- ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .05) +
scale_x_log10()
p2 <- ggplot(aes(x = residual.sugar), data = wqw) +
geom_histogram(binwidth = .01) +
scale_x_log10(breaks = seq(0, 20, 2))
grid.arrange(p1, p2)
```
```{r echo=FALSE, residual.sugar_summary}
summary(wqw$residual.sugar)
```
Using a log_10 transform with a bin width of .05 indicates a bimodal distribution but if you decrease the binwidth to 0.01 it shows that while there are a lot of observations between ~4 and ~20 they are a lot more spread out and there are more individual of each value from ~0.5 to ~2. And we can see in the summary of the data that the median is 5.2 and the mean is 6.4 which puts both of them inbetween the two peaks.
```{r echo=FALSE, chlorides_histogram}
ggplot(aes(x = chlorides), data = wqw) +
geom_histogram(binwidth = .001) +
xlim(0, quantile(wqw$chlorides, 0.97))
```
Here I just removed the top 3% of values to remove the long tail.
```{r echo=FALSE, sulfur.dioxide_histograms}
p1 <- ggplot(aes(x = free.sulfur.dioxide), data = wqw) +
geom_histogram(binwidth = 1) +
xlim(0, quantile(wqw$free.sulfur.dioxide, 0.99))
p2 <- ggplot(aes(x = total.sulfur.dioxide), data = wqw) +
geom_histogram(binwidth = 1) +
xlim(quantile(wqw$total.sulfur.dioxide, 0.01), quantile(wqw$total.sulfur.dioxide, 0.99))
grid.arrange(p1, p2)
```
I plotted the Free Sulphur Dioxide and Total Sulphur Dioxide together to save room and because they are related. Note the difference in scales on both axies.
```{r echo=FALSE, density_histogram}
ggplot(aes(x = density), data = wqw) +
geom_histogram(binwidth = .0001) +
xlim(quantile(wqw$density, 0.01), quantile(wqw$density, 0.99))
```
```{r echo=FALSE, pH_histogram}
ggplot(aes(x = pH), data = wqw) +
geom_histogram(binwidth = .01)
```
The pH plot doesn't need any modification.
```{r echo=FALSE, sulphates_histogram}
ggplot(aes(x = sulphates), data = wqw) +
geom_histogram(binwidth = .01)
```
density pH sulphates
> **Tip**: Make sure that you leave a blank line between the start / end of
each code block and the end / start of your Markdown text so that it is
formatted nicely in the knitted text. Note as well that text on consecutive
lines is treated as a single space. Make sure you have a blank line between
your paragraphs so that they too are formatted for easy readability.
# Univariate Analysis
> **Tip**: Now that you've completed your univariate explorations, it's time to
reflect on and summarize what you've found. Use the questions below to help you
gather your observations and add your own if you have other thoughts!
### What is the structure of your dataset?
### What is/are the main feature(s) of interest in your dataset?
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
### Did you create any new variables from existing variables in the dataset?
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
# Bivariate Plots Section
> **Tip**: Based on what you saw in the univariate plots, what relationships
between variables might be interesting to look at in this section? Don't limit
yourself to relationships between a main output feature and one of the
supporting variables. Try to look at relationships between supporting variables
as well.
```{r echo=FALSE, Bivariate_Plots}
```
# Bivariate Analysis
> **Tip**: As before, summarize what you found in your bivariate explorations
here. Use the questions below to guide your discussion.
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features in \
the dataset?
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
### What was the strongest relationship you found?
# Multivariate Plots Section
> **Tip**: Now it's time to put everything together. Based on what you found in
the bivariate plots section, create a few multivariate plots to investigate
more complex interactions between variables. Make sure that the plots that you
create here are justified by the plots you explored in the previous section. If
you plan on creating any mathematical models, this is the section where you
will do that.
```{r echo=FALSE, Multivariate_Plots}
```
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
### Were there any interesting or surprising interactions between features?
### OPTIONAL: Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
------
# Final Plots and Summary
> **Tip**: You've done a lot of exploration and have built up an understanding
of the structure of and relationships between the variables in your dataset.
Here, you will select three plots from all of your previous exploration to
present here as a summary of some of your most interesting findings. Make sure
that you have refined your selected plots for good titling, axis labels (with
units), and good aesthetic choices (e.g. color, transparency). After each plot,
make sure you justify why you chose each plot by describing what it shows.
### Plot One
```{r echo=FALSE, Plot_One}
```
### Description One
### Plot Two
```{r echo=FALSE, Plot_Two}
```
### Description Two
### Plot Three
```{r echo=FALSE, Plot_Three}
```
### Description Three
------
# Reflection
> **Tip**: Here's the final step! Reflect on the exploration you performed and
the insights you found. What were some of the struggles that you went through?
What went well? What was surprising? Make sure you include an insight into
future work that could be done with the dataset.
> **Tip**: Don't forget to remove this, and the other **Tip** sections before
saving your final work and knitting the final report!