Hierarchical Clustering#

We want to find clusters in a set of celadons (early porcelains). A data set with material properties of a number of celadons found in China is available from UCI Machine Learning Repository. The data set originates from research work published in Data-driven research on chemical features of Jingdezhen and Longquan celadon by energy dispersive X-ray fluorescence by Ziyang He, Maolin Zhang, Haozhe Zhang. A free preprint PDF file is available, too.

Task: Read (at least) the last paragraph of section 1 and section 2 of the afore mentioned preprint. How many celadon sample do you expect in the data set after reading?

Solution:

# your answer

Understanding the data#

Data comes as a CSV file.

Task: Load the data to a data frame. Why are there so many samples?

Solution:

# your solution

Task: Create a NumPy array holding the data with one row per sample and one column per feature.

Solution:

# your solution

Task: Create a list of sample names.

Solution:

# your solution

Preprocessing#

Units of measurement are weight per cent for body features and parts per million for glaze features. Since all features are equally important for finding similar celadons we should standardize features independently. Maybe the assumption of equal importance is not correct, but without further domain knowledge we cannot do better.

Task: Standardize all features.

Solution:

# your solution

Hierarchical Clustering#

Task: Plot a dendrogram and determine a sensible number of clusters.

Solution:

# your solution

Task: Cluster data into the chosen number of clusters.

Solution:

# your solution

\(k\)-Means Clustering#

Task: Use \(k\)-means for clustering. Determine a good \(k\).

Solution:

# your solution

Task: Compare results from hierarchical and \(k\)-means clustering.

Solution:

# your solution