Hierarchical Clustering
Contents
Hierarchical Clustering#
We want to find clusters in a set of celadons (early porcelains). A data set with material properties of a number of celadons found in China is available from UCI Machine Learning Repository. The data set originates from research work published in Data-driven research on chemical features of Jingdezhen and Longquan celadon by energy dispersive X-ray fluorescence by Ziyang He, Maolin Zhang, Haozhe Zhang. A free preprint PDF file is available, too.
Task: Read (at least) the last paragraph of section 1 and section 2 of the afore mentioned preprint. How many celadon sample do you expect in the data set after reading?
Solution:
# your answer
Understanding the data#
Data comes as a CSV file.
Task: Load the data to a data frame. Why are there so many samples?
Solution:
# your solution
Task: Create a NumPy array holding the data with one row per sample and one column per feature.
Solution:
# your solution
Task: Create a list of sample names.
Solution:
# your solution
Preprocessing#
Units of measurement are weight per cent for body features and parts per million for glaze features. Since all features are equally important for finding similar celadons we should standardize features independently. Maybe the assumption of equal importance is not correct, but without further domain knowledge we cannot do better.
Task: Standardize all features.
Solution:
# your solution
Hierarchical Clustering#
Task: Plot a dendrogram and determine a sensible number of clusters.
Solution:
# your solution
Task: Cluster data into the chosen number of clusters.
Solution:
# your solution
\(k\)-Means Clustering#
Task: Use \(k\)-means for clustering. Determine a good \(k\).
Solution:
# your solution
Task: Compare results from hierarchical and \(k\)-means clustering.
Solution:
# your solution