Supermarket Customers#

To understand general customer behavior and for targeted advertising supermarket companies have to identify groups of customers with similar behavior. Sending all customers identical advertisments fails most customers needs. Producing individual advertisements for each potential customer would be too expensive. Thus, clustering customers into a handful of groups should be a sensible middle ground.

There are several data sets for supermarkets available. We use the one provided at Michele Coscia’s website. It is sufficiently ridge and has simple structure. The data set is free to use (private communication with M. Coscia). Data comes from italian Coop supermarkets. In Explaining the Product Range Effect in Purchase Data the data set is described in more detail.

Understanding the Data Set#

Task: Get the data set. Read section III of the accompanying paper till the end of the left column on page 3. Answer the following questions:

  • How many shops?

  • How many customers?

  • Are there customers missing in the data?

  • Which time interval?

  • What’s the detail level of products?

  • How many products?

Solution:

# your answer

Task: Load prices and purchases data.

Solution:

# your solution

Task: Collect following product information:

  • number of shops selling the product,

  • total quantity sold,

  • number of customers who bought the product,

  • maximum quantity bought by one customer.

Get minimum, average, maximum for all values.

Solution:

# your solution

Task: Collect following customer information:

  • number of shops visited,

  • total number of items bought,

  • number of different products bought.

Get minimum, average, maximum for all values.

Solution:

# your solution

Cleaning the Data Set#

The aim of this project is to cluster the set of customers into a handful of groups for targeted advertising. Outliers are not of interest because else the number of groups would become to large and advertising too expensive.

We only are interested in average customers and products. For example, products bought only by very few customers or not available in all shops should be removed from the data set. Customers buying only occasionally at the shops should be removed, too.

Task: Remove all products, customers, purchases such that remaining data satisfies the following conditions:

  • each product has been sold in all shops,

  • each product has been sold at least 1000 times,

  • each product has been bought by at least 100 different customers,

  • each product has been bought at least 4 times by at least one customer,

  • each customer bought at least 10 items per month (on average),

  • each customer bought at least 20 different products.

How many products, customers, purchases do we have now?

Solution:

# your solution

Preparing Data for Clustering#

We want to use Scikit-Learn’s \(k\)-means implementation for clustering. Thus, we have to bring our data into the right shape.

Task: Create a NumPy array with one row per customer and one column per product. Store quantities of products bought by each customer in the array. Sort customers descending with respect to the total number of items they bought and products descending with respect to the total quantity sold (sorting may simplify visualization later on).

Solution:

# your solution

Scaling#

Scaling the data will influence the clustering process. We have several options:

  • Without scaling data, products sold in large quantities will dominate the Euclidean distance between the product vectors of two customers. Thus, customers will have small distance if the products they bought most often do coincide.

  • Standardization per product ensures that the total quantity sold of a product does not matter. The distance between two product vectors is the mean squared difference of per product quantities bought by both customers. Customers buying similar quantities of each product will have small distance.

  • If we are more interested in the selection of products of each customer than in the quantities bought, we should normalize the product vectors. Then the total quantity bought by each customer is identical and data only contains information on the composition of each customers shopping cart. Here \(\ell^1\)-norm should be used. Then all product quantities will sum to 1 and can be interpreted as probability that a product is bought by the customer.

Task: Prepare productwise standardized data and customerwise normalized data.

Solution:

# your solution

Clustering#

Task: Cluster the data set with \(k\)-means for unscaled, standardized, and normalized data. Choose some good \(k\) for each variant. Keep the three KMeans objects with best \(k\) for further analysis.

Solution:

# your solution

Analyzing the Clusters#

Now that we have identified groups of customers with similar behavior, it’s time to understand those groups. Remember, that we want to adapt our advertising campaign to customer behavior.

Task: Visualize cluster centers in a quantity versus product index plot. Don’t forget to back scale the data.

Solution:

# your solution

Each cluster center can be regarded as a prototype customer of the cluster.

Task: Characterize the prototype customers for unscaled and standardized data in words a person designing advertising campaigns understands.

Solution:

# your answer

The first two clusterings are more or less trivial and useless for targeted advertising. The third clustering deserves further investigation.

Task: Get the 100 most popular products (highest average per customer quantity) per cluster for the third clustering. By how many products differ the top 100 products? Do the same for the first clustering and for a random clustering.

Solution:

# your solution

Task: Consider the third clustering only. Does one of the customer groups buy higher quantities than the other? Visualize the answer to this question for different price regions.

Solution:

# your solution