This is a public dataset of white variants of the Portuguese “Vinho Verde” wine. The details are described in [Cortez et al., 2009].
(P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.)
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
The data file is read in as ‘wineQualityWhites.csv’.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
As a simple view shows, it contains 12 features and the quality is I are most interested in. All informations are listed below:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
And there’s no missing data.
## [1] FALSE
First, I would want to understand the distribution of single variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
50% of the wines have Fixed Acidity from 6.3 to 7.3g/L. Median is 6.8g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile Acidity is a positively skewed normal distribution. Median value is 0.26. 50% of wines have Volatile Acidity between 0.21 and 0.32g/L. Since when volatile acidity at too high of levels can lead to an unpleasant, vinegar taste, I would expect to find a negative correlation between volatile acidity and quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
50% of the wines have citric acid from 0.27 to 0.39g/L. Median is 0.32g/L. Citric acid can add freshness and flavor to wines. It may have a positive effect to quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Most of the wines don’t have much sugar and it’s rare to exceed 20. Transformed the long tail data to better understand the distribution of sugar, it shows two extreme large counts that are separated like a binomial distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
50% of wines have chlorides from 0.036g/L to 0.05g/L and median is 0.043g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
sulfur dioxide plays an important role in winemaking as it’s an anti-microbial agent and antioxidant. So it would be interesting to find out whether sulfur dioxide is affecting the quality of wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total sulfur dioxide may has correlation with sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density is depended on the percent alcohol and sugar content. The distribution of density shows that there’s outliers. Besides the outliers, it’s normal distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH describes how acidic a wine is. From the distribution shows that 50% of wines’ pH lies within 3.090 to 3.280.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphates acts as an antimicrobial and antioxidant. It’s nearly-normal distributed and 50% of the wines have sulphates from 0.41 to 0.55 g/dm3. Median is 0.4898 g/dm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The above plot shows alcohol distribution and the summary stats. Althogh it is not strictly binomial, it does have two peaks.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
From the distribution of quality I found that there’s no wine of quality 0,1,2,and 10. Besides, count of 3,4,8 and 9 is quite small. So latter I create quality bucket to increase data count in each group. I hope that can make the visualization more clear.
The dataset contains objetive and subjective quality data for 4898 white wines and 12 features. Except quality, other 11 features are listed as follow:
Acid: fixed acidity, volatile acidity, citric acid
Sugar: residual suga
Salt: chlorides
Alcohol: alcohol
Chemicals: free sulfur dioxide, total sulfur dioxide, sulphates, pH
Physical: density
After plotting the single features and read some stuff about winemaking, I found that sulfur dioxide, acidity and alcohol are most interesting. I wonder if fixed acidity, volatile acidity and free sulfur dioxide can effect wine quality.
Sugar, pH and density may also influence the result.
transform quality to factor.(wine\(quality = as.factor(wine\)quality))
I log-transformed the right skewed sugar distributions. The tranformed distribution for sugar appears bimodal with the two peaks.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 0.7461047654 2.099464e-03 3.182328e-02
## volatile.acidity 0.0020994637 1.003697e-02 -1.619317e-03
## citric.acid 0.0318232787 -1.619317e-03 1.510034e-02
## residual.sugar 0.2597647632 4.547870e-02 5.263599e-02
## chlorides 0.0003641005 9.572893e-05 3.440716e-04
## free.sulfur.dioxide -1.0400747419 -1.656490e-01 1.092229e-01
## total.sulfur.dioxide 3.7759423470 6.379205e-01 3.936402e-01
## density 0.0006420988 2.103030e-05 4.692457e-05
## pH -0.0609620074 -1.255798e-03 -3.379584e-03
## sulphates -0.0083034844 -1.660694e-04 3.214771e-05
## alcohol -0.1199684099 2.494327e-03 -9.047625e-03
## quality -0.1230124684 -2.043598e-02 1.757464e-03
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.25976476 3.641005e-04 -1.04007474
## volatile.acidity 0.04547870 9.572893e-05 -0.16564903
## citric.acid 0.05263599 3.440716e-04 0.10922289
## residual.sugar 27.10330832 1.031135e-02 26.38488456
## chlorides 0.01031135 3.991821e-04 0.02270944
## free.sulfur.dioxide 26.38488456 2.270944e-02 284.07454732
## total.sulfur.dioxide 89.54116360 1.702355e-01 434.69109518
## density 0.01337119 1.656674e-05 0.01504912
## pH -0.11945507 -1.724823e-04 -0.02772824
## sulphates -0.01156195 -3.737575e-05 0.11708837
## alcohol -3.06052528 -9.210864e-03 -5.65643143
## quality -0.64517452 -3.304398e-03 -0.25170183
## total.sulfur.dioxide density pH
## fixed.acidity 3.77594235 6.420988e-04 -6.096201e-02
## volatile.acidity 0.63792050 2.103030e-05 -1.255798e-03
## citric.acid 0.39364020 4.692457e-05 -3.379584e-03
## residual.sugar 89.54116360 1.337119e-02 -1.194551e-01
## chlorides 0.17023552 1.656674e-05 -1.724823e-04
## free.sulfur.dioxide 434.69109518 1.504912e-02 -2.772824e-02
## total.sulfur.dioxide 1797.88201842 6.960388e-02 1.348139e-02
## density 0.06960388 9.105398e-06 -3.627898e-05
## pH 0.01348139 -3.627898e-05 2.281550e-02
## sulphates 0.61720623 1.325062e-05 2.811087e-03
## alcohol -23.58842976 -2.952014e-03 1.801804e-02
## quality -8.35562013 -9.193057e-04 1.211501e-02
## sulphates alcohol quality
## fixed.acidity -8.303484e-03 -0.119968410 -0.1230124684
## volatile.acidity -1.660694e-04 0.002494327 -0.0204359850
## citric.acid 3.214771e-05 -0.009047625 0.0017574643
## residual.sugar -1.156195e-02 -3.060525276 -0.6451745152
## chlorides -3.737575e-05 -0.009210864 -0.0033043981
## free.sulfur.dioxide 1.170884e-01 -5.656431426 -0.2517018350
## total.sulfur.dioxide 6.172062e-01 -23.588429756 -8.3556201337
## density 1.325062e-05 -0.002952014 -0.0009193057
## pH 2.811087e-03 0.018018043 0.0121150149
## sulphates 1.264227e-02 0.004524218 0.0121514435
## alcohol 4.524218e-03 1.516080608 0.4775429243
## quality 1.215144e-02 0.477542924 0.7869068929
The quality don’t have very strong correlation with other features. This’s because quality is discrete and only have 7 integer values. Density vs. sugar and density vs. alcohol have quite a strong correlation. Free sulfur dioxide, and total sulfur dioxide also have a good correlation which is reasonable since they are related in contents. The variables (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, ”free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, ”sulphates”, ”alcohol”, “quality”) have been renamed to be properly displayed on the graph (“FixA”, “VolA”, “Citric”, “Sugar”, “CI”, “FreeSO2”, “SO2”, “Dens”, “pH”, “SO4”, “Alc”,“qual”)
## relevant correlation values: 0.8389665
There’s a clear trends that sugar and density have a positive correlation. But it’s not that strong because when sugar is low, the density has a wide range.
ggplot(aes(x=alcohol, y=density),data=wine)+
geom_point(alpha=0.5,position=position_jitter(h=0))+
ylim(min(wine$density),quantile(wine$density,0.99))
## Warning: Removed 49 rows containing missing values (geom_point).
c = cor(wine$alcohol,wine$density)
cat("relevant correlation values:",c)
## relevant correlation values: -0.7801376
There’s also a clear trends that with the alochol increase, the density tends to decrease. It’ reasonable.
It’s not quite clear how the distribution vary fron different quality.
The plots show different distribution of each feature against quality and quality of 5 and 6 have the most variability in all plots.
It seems that high quality wines(quality at 6,7,8) tend to have high level of alcohol and free sulfur dioxide, as well as a low density. pH also have a more clear relationshipe then fixed acidity and volatile acidity.
I cut the quailty to 3 bucktes and re-plot the boxplot for each feature.
This time, chlorides and pH show a clear relationship with quality.
## relevant correlation values: 0.4355747
scatter plot to show the different distribution of alcohol against quality.
## relevant correlation values: 0.615501
Free sulfur dioxide and total sulfur dioxide have a correlation of 0.608, which is reasonable. As free is part of total.
Density vs. sugar and density vs. alcohol have quite a strong correlation. Free sulfur dioxide, and total sulfur dioxide also have a good correlation which is reasonable since they are related in contents.
Alcohol and density were found to be related to wine quality. High quality wine tend to have high alcohol and low density.
free sulfur dioxide and total sulfur dioxide have a correlation of 0.608, which is reasonable. As free is part of total. Sugar and density have a positive correlation.
Alcohol vs. quality and density vs. quality relationship is clear and fit my concern. pH and chlorides are also seem to be a useful feature.
## Source: local data frame [3 x 2]
##
## quality_bucket coefficient
## <fctr> <dbl>
## 1 (2,5] -0.6801808
## 2 (5,7] -0.7814067
## 3 (7,10] -0.8765548
From this scatter plot, it’s clear to see high quality wines are lied on the right bottom corner that represent low density and high alcohol.
## Source: local data frame [3 x 2]
##
## quality_bucket coefficient
## <fctr> <dbl>
## 1 (2,5] -0.6801808
## 2 (5,7] -0.7814067
## 3 (7,10] -0.8765548
## <ggproto object: Class CoordCartesian, Coord>
## aspect: function
## distance: function
## expand: TRUE
## is_linear: function
## labels: function
## limits: list
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## train: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord>
## Source: local data frame [7 x 4]
##
## quality fixed_volatile_mean fixed_volatile_median n
## <fctr> <dbl> <dbl> <int>
## 1 3 26.17604 23.81410 20
## 2 4 22.44670 21.61290 163
## 3 5 25.32756 23.87097 1457
## 4 6 29.06846 27.72727 2198
## 5 7 28.97991 27.29021 880
## 6 8 27.73294 26.66667 175
## 7 9 25.86895 27.30769 5
## Source: local data frame [7 x 2]
##
## quality coefficient
## <fctr> <dbl>
## 1 3 -0.091465468
## 2 4 -0.007696289
## 3 5 0.130496004
## 4 6 -0.130009906
## 5 7 -0.501418319
## 6 8 -0.605793292
## 7 9 -0.589626550
These plots show how the ratio of fixed acidity and volatile acidity can affect quality. There’s not a clear relationship, but the boxplot tells me that the ratio should be above 25(median) to make a high quality wine (quality>5). The coefficient table of fixed acidity/volatile acidity and alcohol in each group implies that high quality wines also have a high coefficient between fixed acidity/volatile acidity and alcohol
## Source: local data frame [3 x 2]
##
## quality_bucket coefficient
## <fctr> <dbl>
## 1 (2,5] -0.03334959
## 2 (5,7] 0.13101660
## 3 (7,10] 0.50122027
I cut the quality into 3 buckets and re-plot the ratio of fixed acidity and volatile acidity vs quality. High quality wine tends to locate at the right top of the plot with high alcohol.
Here is a sugar-alcohol-quality plot, it shows great wine don’t have a lot of sugar. They have some sugar and alcohol is high in most cases. But sugar and alcohol are not quite correlated.
Alcohol is further proved to be a useful feature to judge wine quality. The ratio of fixed acidity and volatile acidity was hoped to be correlated to the quality. But the trend is not clear.
The (5,7] bucket of wines have a wide range of sugar values. This makes th plot hard to interpret and the relationship of sugar and quality unclear.
These plot shows the relationship of fixed acidity, volatile acidity and quality. Although there’s no clear correlation, it shows that the wine with high ratio of volatile acidity and quality tends to has high quality. Like the point of high quality wine lies at the left corner of the plot and the median of high quality wine in the boxplot are above 25.
The box plot show that the alcohol of different quality are varied. It’s clear that the median of wines that quality from 5 to 9 are increased. When I cut the quality to three groups, the tends is even more clear.
The negtive correlation between density and alcohol is shown on the plot. And when colored by quality, the high quality wines are lies on the right corner with high quality and low density.
This dataset is different from the diamond data and facebook data that used in the course. First, as the quality is concerned, it’s a multiple classification problem. Second, this data don’t have nominal and ordinal columns except quality.
When first deal with this data, I felt clueless as no feature seems to have strong correlation with quality. But I digged into the data. Of course, some plots may seems useless or different plots tell similar ideas, but no matter what, we have to let the data speak. The multivariate analysis can tell a more complete story since it involves more data. And I really enjoy the journey to this end.
There are plenty of ways to improve my analysis on the white wine data. Some people combined the red and white wine data, which I think is interesting and many of my key decisions in this analysis were based on investigating the relationships between correlating variables, but there are certain non-correlating that still warrant additional investigation.
I’m looking forward to the next step to build classification and regression models in the future.