A data exploring of white wine quality

by Kai Wang

1. Introduction

This is a public dataset of white variants of the Portuguese “Vinho Verde” wine. The details are described in [Cortez et al., 2009].

(P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.)

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016

[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf

[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

2. Data

The data file is read in as ‘wineQualityWhites.csv’.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

As a simple view shows, it contains 12 features and the quality is I are most interested in. All informations are listed below:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulphates (potassium sulphate - g / dm3)

11 - alcohol (% by volume)

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

And there’s no missing data.

## [1] FALSE

3. Exploring

3.1 Univariate Plots Section

First, I would want to understand the distribution of single variables.

3.1.1 Fixed Acidity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

50% of the wines have Fixed Acidity from 6.3 to 7.3g/L. Median is 6.8g/L.

3.1.2 Volatile Acidity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile Acidity is a positively skewed normal distribution. Median value is 0.26. 50% of wines have Volatile Acidity between 0.21 and 0.32g/L. Since when volatile acidity at too high of levels can lead to an unpleasant, vinegar taste, I would expect to find a negative correlation between volatile acidity and quality.

3.1.3 Citric.acid
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

50% of the wines have citric acid from 0.27 to 0.39g/L. Median is 0.32g/L. Citric acid can add freshness and flavor to wines. It may have a positive effect to quality.

3.1.4 Residual.sugar
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Most of the wines don’t have much sugar and it’s rare to exceed 20. Transformed the long tail data to better understand the distribution of sugar, it shows two extreme large counts that are separated like a binomial distribution.

3.1.5 Chlorides
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

50% of wines have chlorides from 0.036g/L to 0.05g/L and median is 0.043g/L.

3.1.6 Free Sulfur Dioxide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

sulfur dioxide plays an important role in winemaking as it’s an anti-microbial agent and antioxidant. So it would be interesting to find out whether sulfur dioxide is affecting the quality of wines.

3.1.7 Total Sulfur Dioxide
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Total sulfur dioxide may has correlation with sulfur dioxide.

3.1.8 Density
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density is depended on the percent alcohol and sugar content. The distribution of density shows that there’s outliers. Besides the outliers, it’s normal distributed.

3.1.9 pH
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH describes how acidic a wine is. From the distribution shows that 50% of wines’ pH lies within 3.090 to 3.280.

3.1.10 Sulphates
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates acts as an antimicrobial and antioxidant. It’s nearly-normal distributed and 50% of the wines have sulphates from 0.41 to 0.55 g/dm3. Median is 0.4898 g/dm3.

3.1.11 Alcohol
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The above plot shows alcohol distribution and the summary stats. Althogh it is not strictly binomial, it does have two peaks.

3.1.12 quality of wines
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

From the distribution of quality I found that there’s no wine of quality 0,1,2,and 10. Besides, count of 3,4,8 and 9 is quite small. So latter I create quality bucket to increase data count in each group. I hope that can make the visualization more clear.

3.2 Univariate Analysis

What is the structure of your dataset?

The dataset contains objetive and subjective quality data for 4898 white wines and 12 features. Except quality, other 11 features are listed as follow:

  • Acid: fixed acidity, volatile acidity, citric acid

  • Sugar: residual suga

  • Salt: chlorides

  • Alcohol: alcohol

  • Chemicals: free sulfur dioxide, total sulfur dioxide, sulphates, pH

  • Physical: density

What is/are the main feature(s) of interest in your dataset?

After plotting the single features and read some stuff about winemaking, I found that sulfur dioxide, acidity and alcohol are most interesting. I wonder if fixed acidity, volatile acidity and free sulfur dioxide can effect wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Sugar, pH and density may also influence the result.

Did you create any new variables from existing variables in the dataset?

transform quality to factor.(wine\(quality = as.factor(wine\)quality))

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the right skewed sugar distributions. The tranformed distribution for sugar appears bimodal with the two peaks.

3.3 Bivariate Plots Section

##                      fixed.acidity volatile.acidity   citric.acid
## fixed.acidity         0.7461047654     2.099464e-03  3.182328e-02
## volatile.acidity      0.0020994637     1.003697e-02 -1.619317e-03
## citric.acid           0.0318232787    -1.619317e-03  1.510034e-02
## residual.sugar        0.2597647632     4.547870e-02  5.263599e-02
## chlorides             0.0003641005     9.572893e-05  3.440716e-04
## free.sulfur.dioxide  -1.0400747419    -1.656490e-01  1.092229e-01
## total.sulfur.dioxide  3.7759423470     6.379205e-01  3.936402e-01
## density               0.0006420988     2.103030e-05  4.692457e-05
## pH                   -0.0609620074    -1.255798e-03 -3.379584e-03
## sulphates            -0.0083034844    -1.660694e-04  3.214771e-05
## alcohol              -0.1199684099     2.494327e-03 -9.047625e-03
## quality              -0.1230124684    -2.043598e-02  1.757464e-03
##                      residual.sugar     chlorides free.sulfur.dioxide
## fixed.acidity            0.25976476  3.641005e-04         -1.04007474
## volatile.acidity         0.04547870  9.572893e-05         -0.16564903
## citric.acid              0.05263599  3.440716e-04          0.10922289
## residual.sugar          27.10330832  1.031135e-02         26.38488456
## chlorides                0.01031135  3.991821e-04          0.02270944
## free.sulfur.dioxide     26.38488456  2.270944e-02        284.07454732
## total.sulfur.dioxide    89.54116360  1.702355e-01        434.69109518
## density                  0.01337119  1.656674e-05          0.01504912
## pH                      -0.11945507 -1.724823e-04         -0.02772824
## sulphates               -0.01156195 -3.737575e-05          0.11708837
## alcohol                 -3.06052528 -9.210864e-03         -5.65643143
## quality                 -0.64517452 -3.304398e-03         -0.25170183
##                      total.sulfur.dioxide       density            pH
## fixed.acidity                  3.77594235  6.420988e-04 -6.096201e-02
## volatile.acidity               0.63792050  2.103030e-05 -1.255798e-03
## citric.acid                    0.39364020  4.692457e-05 -3.379584e-03
## residual.sugar                89.54116360  1.337119e-02 -1.194551e-01
## chlorides                      0.17023552  1.656674e-05 -1.724823e-04
## free.sulfur.dioxide          434.69109518  1.504912e-02 -2.772824e-02
## total.sulfur.dioxide        1797.88201842  6.960388e-02  1.348139e-02
## density                        0.06960388  9.105398e-06 -3.627898e-05
## pH                             0.01348139 -3.627898e-05  2.281550e-02
## sulphates                      0.61720623  1.325062e-05  2.811087e-03
## alcohol                      -23.58842976 -2.952014e-03  1.801804e-02
## quality                       -8.35562013 -9.193057e-04  1.211501e-02
##                          sulphates       alcohol       quality
## fixed.acidity        -8.303484e-03  -0.119968410 -0.1230124684
## volatile.acidity     -1.660694e-04   0.002494327 -0.0204359850
## citric.acid           3.214771e-05  -0.009047625  0.0017574643
## residual.sugar       -1.156195e-02  -3.060525276 -0.6451745152
## chlorides            -3.737575e-05  -0.009210864 -0.0033043981
## free.sulfur.dioxide   1.170884e-01  -5.656431426 -0.2517018350
## total.sulfur.dioxide  6.172062e-01 -23.588429756 -8.3556201337
## density               1.325062e-05  -0.002952014 -0.0009193057
## pH                    2.811087e-03   0.018018043  0.0121150149
## sulphates             1.264227e-02   0.004524218  0.0121514435
## alcohol               4.524218e-03   1.516080608  0.4775429243
## quality               1.215144e-02   0.477542924  0.7869068929

The quality don’t have very strong correlation with other features. This’s because quality is discrete and only have 7 integer values. Density vs. sugar and density vs. alcohol have quite a strong correlation. Free sulfur dioxide, and total sulfur dioxide also have a good correlation which is reasonable since they are related in contents. The variables (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, ”free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, ”sulphates”, ”alcohol”, “quality”) have been renamed to be properly displayed on the graph (“FixA”, “VolA”, “Citric”, “Sugar”, “CI”, “FreeSO2”, “SO2”, “Dens”, “pH”, “SO4”, “Alc”,“qual”)

density vs. sugar

## relevant correlation values: 0.8389665

There’s a clear trends that sugar and density have a positive correlation. But it’s not that strong because when sugar is low, the density has a wide range.

density vs. alcohol
ggplot(aes(x=alcohol, y=density),data=wine)+
  geom_point(alpha=0.5,position=position_jitter(h=0))+
  ylim(min(wine$density),quantile(wine$density,0.99))
## Warning: Removed 49 rows containing missing values (geom_point).

c = cor(wine$alcohol,wine$density)
cat("relevant correlation values:",c)
## relevant correlation values: -0.7801376

There’s also a clear trends that with the alochol increase, the density tends to decrease. It’ reasonable.

Histogram with acidity and quality

It’s not quite clear how the distribution vary fron different quality.

polygon with color by quality

The plots show different distribution of each feature against quality and quality of 5 and 6 have the most variability in all plots.

boxplot with color by quality

It seems that high quality wines(quality at 6,7,8) tend to have high level of alcohol and free sulfur dioxide, as well as a low density. pH also have a more clear relationshipe then fixed acidity and volatile acidity.

boxplot with color by quality_bucked

I cut the quailty to 3 bucktes and re-plot the boxplot for each feature.

This time, chlorides and pH show a clear relationship with quality.

## relevant correlation values: 0.4355747

scatter plot to show the different distribution of alcohol against quality.

Free sulfur dioxide vs. total sulfur dioxide

## relevant correlation values: 0.615501

Free sulfur dioxide and total sulfur dioxide have a correlation of 0.608, which is reasonable. As free is part of total.

3.4 Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Density vs. sugar and density vs. alcohol have quite a strong correlation. Free sulfur dioxide, and total sulfur dioxide also have a good correlation which is reasonable since they are related in contents.

Alcohol and density were found to be related to wine quality. High quality wine tend to have high alcohol and low density.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

free sulfur dioxide and total sulfur dioxide have a correlation of 0.608, which is reasonable. As free is part of total. Sugar and density have a positive correlation.

What was the strongest relationship you found?

Alcohol vs. quality and density vs. quality relationship is clear and fit my concern. pH and chlorides are also seem to be a useful feature.

3.4 Multivariate Plots Section

## Source: local data frame [3 x 2]
## 
##   quality_bucket coefficient
##           <fctr>       <dbl>
## 1          (2,5]  -0.6801808
## 2          (5,7]  -0.7814067
## 3         (7,10]  -0.8765548

From this scatter plot, it’s clear to see high quality wines are lied on the right bottom corner that represent low density and high alcohol.

## Source: local data frame [3 x 2]
## 
##   quality_bucket coefficient
##           <fctr>       <dbl>
## 1          (2,5]  -0.6801808
## 2          (5,7]  -0.7814067
## 3         (7,10]  -0.8765548

## <ggproto object: Class CoordCartesian, Coord>
##     aspect: function
##     distance: function
##     expand: TRUE
##     is_linear: function
##     labels: function
##     limits: list
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     train: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord>
## Source: local data frame [7 x 4]
## 
##   quality fixed_volatile_mean fixed_volatile_median     n
##    <fctr>               <dbl>                 <dbl> <int>
## 1       3            26.17604              23.81410    20
## 2       4            22.44670              21.61290   163
## 3       5            25.32756              23.87097  1457
## 4       6            29.06846              27.72727  2198
## 5       7            28.97991              27.29021   880
## 6       8            27.73294              26.66667   175
## 7       9            25.86895              27.30769     5

## Source: local data frame [7 x 2]
## 
##   quality  coefficient
##    <fctr>        <dbl>
## 1       3 -0.091465468
## 2       4 -0.007696289
## 3       5  0.130496004
## 4       6 -0.130009906
## 5       7 -0.501418319
## 6       8 -0.605793292
## 7       9 -0.589626550

These plots show how the ratio of fixed acidity and volatile acidity can affect quality. There’s not a clear relationship, but the boxplot tells me that the ratio should be above 25(median) to make a high quality wine (quality>5). The coefficient table of fixed acidity/volatile acidity and alcohol in each group implies that high quality wines also have a high coefficient between fixed acidity/volatile acidity and alcohol

## Source: local data frame [3 x 2]
## 
##   quality_bucket coefficient
##           <fctr>       <dbl>
## 1          (2,5] -0.03334959
## 2          (5,7]  0.13101660
## 3         (7,10]  0.50122027

I cut the quality into 3 buckets and re-plot the ratio of fixed acidity and volatile acidity vs quality. High quality wine tends to locate at the right top of the plot with high alcohol.

Here is a sugar-alcohol-quality plot, it shows great wine don’t have a lot of sugar. They have some sugar and alcohol is high in most cases. But sugar and alcohol are not quite correlated.

3.5 Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol is further proved to be a useful feature to judge wine quality. The ratio of fixed acidity and volatile acidity was hoped to be correlated to the quality. But the trend is not clear.

Were there any interesting or surprising interactions between features?

The (5,7] bucket of wines have a wide range of sugar values. This makes th plot hard to interpret and the relationship of sugar and quality unclear.


4. Final Plots and Summary

Plot One

Description One

These plot shows the relationship of fixed acidity, volatile acidity and quality. Although there’s no clear correlation, it shows that the wine with high ratio of volatile acidity and quality tends to has high quality. Like the point of high quality wine lies at the left corner of the plot and the median of high quality wine in the boxplot are above 25.

Plot Two

Description Two

The box plot show that the alcohol of different quality are varied. It’s clear that the median of wines that quality from 5 to 9 are increased. When I cut the quality to three groups, the tends is even more clear.

Plot Three

Description Three

The negtive correlation between density and alcohol is shown on the plot. And when colored by quality, the high quality wines are lies on the right corner with high quality and low density.

5. Reflection

This dataset is different from the diamond data and facebook data that used in the course. First, as the quality is concerned, it’s a multiple classification problem. Second, this data don’t have nominal and ordinal columns except quality.

When first deal with this data, I felt clueless as no feature seems to have strong correlation with quality. But I digged into the data. Of course, some plots may seems useless or different plots tell similar ideas, but no matter what, we have to let the data speak. The multivariate analysis can tell a more complete story since it involves more data. And I really enjoy the journey to this end.

There are plenty of ways to improve my analysis on the white wine data. Some people combined the red and white wine data, which I think is interesting and many of my key decisions in this analysis were based on investigating the relationships between correlating variables, but there are certain non-correlating that still warrant additional investigation.

I’m looking forward to the next step to build classification and regression models in the future.