The UCI hosts a dataset of wine measurements that is fantastic for demonstrating the importance of feature scaling in unsupervised learning . There are a bunch of real-valued measurements (of e.g. chemical composition) for three varieties of wine: Barolo, Grignolino and Barbera. I am using this dataset to introduce feature scaling my course on DataCamp.
The wine measurements have very different scales, and performing a k-means clustering (with 3 clusters) on the unscaled measurements yields clusters that don’t correspond to the varieties:
varieties Barbera Barolo Grignolino labels 0 29 13 20 1 0 46 1 2 19 0 50
However, if the measurements are first standardised, the clusters correspond almost perfectly to the three wine varieties:
varieties Barbera Barolo Grignolino labels 0 0 59 3 1 48 0 3 2 0 0 65
Of course, k-means clustering is not meant to be a classifier! But when the clusters do correspond so well to the classes, then it is apparent that the scaling is pretty good.
Which wine varieties?
I had to search to find the names of the wine varieties. According to page 9 of “Chemometrics with R” (Ron Wehrens), the three varieties are: Barolo (58 samples), Grignolino (71 samples) and Barbera (48 samples). I was unable to follow Wehren’s citation (it’s his ) on Google books.
According to the UCI page, this dataset originated from the paper V-PARVUS. An Extendible Pachage of programs for esplorative data analysis, classification and regression analysis by Forina, M., Lanteri, S. Armanino, C., Casolino, C., Casale, M., Oliveri, P.
Olive oil dataset
There is a dataset of olive oil measurements associated to the same paper. I haven’t tried using it, but am sure I’ll use it in an example one day.