Wine dataset demonstrates importance of feature scaling

The UCI hosts a dataset of wine measurements that is fantastic for demonstrating the importance of feature scaling in unsupervised learning . There are a bunch of real-valued measurements (of e.g. chemical composition) for three varieties of wine: Barolo, Grignolino and Barbera. I am using this dataset to introduce feature scaling my course on DataCamp.

The wine measurements have very different scales, and performing a k-means clustering (with 3 clusters) on the unscaled measurements yields clusters that don’t correspond to the varieties:

varieties  Barbera  Barolo  Grignolino
0               29      13          20
1                0      46           1
2               19       0          50

However, if the measurements are first standardised, the clusters correspond almost perfectly to the three wine varieties:

varieties  Barbera  Barolo  Grignolino
0                0      59           3
1               48       0           3
2                0       0          65

Of course, k-means clustering is not meant to be a classifier! But when the clusters do correspond so well to the classes, then it is apparent that the scaling is pretty good.

Which wine varieties?

I had to search to find the names of the wine varieties. According to page 9 of “Chemometrics with R” (Ron Wehrens), the three varieties are: Barolo (58 samples), Grignolino (71 samples) and Barbera (48 samples). I was unable to follow Wehren’s citation (it’s his [6]) on Google books.

Original source

According to the UCI page, this dataset originated from the paper V-PARVUS. An Extendible Pachage of programs for esplorative data analysis, classification and regression analysis by Forina, M., Lanteri, S. Armanino, C., Casolino, C., Casale, M., Oliveri, P.

Olive oil dataset

There is a dataset of olive oil measurements associated to the same paper. I haven’t tried using it, but am sure I’ll use it in an example one day.

Leave a Reply

Your email address will not be published. Required fields are marked *