Clustering Wholesale Customer

Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

히치키치

Clustering Wholesale Customer 본문

Clustering Wholesale Customer

히치키치 2023. 2. 23. 17:22

Hello everyone, this is team Kihakhak. For our final project, among diverse machine learning algorithm, we decided to focus on unsupervised algorithm, especially clustering.

These are our team members.

While searching for problems to solve using machine learning, we discovered that distribution industry has a lot to consider, such as customers sale, sales item, and etc. Running business with thousands of customers requires deeper understanding of customers, but it is hard to identify important features that has huge impact on customers' decision since there are so many factors. Therefore, we decided to use unsupervised machine learning algorithm to solve this problem and help owners conduct efficient strategy for their customers. To sum up, our project goal is to cluster customers based on spending and analyze whether there is specific characteristic that appear commonly in each clusters.

We used Wholesale customer dataset from Kaggle. There are 8 columns in dataset. First two columns are channel and region.

Each number specifies distribution channel which is divided into HoReCa, which stands for hotel, restaurant, cafe, and retail. Each number specifies location for Regions column. Moving on, other six columns represent annual spendings on Fresh, Milk, Grocery, Frozen, Detergents paper, and Delicatessen.

We came up with three candidate models, which are DBScan, K-means, and Hierarchical Clustering.

Our first candidate DBScan got eliminated because it failed to return accurate result although we preprocessed data as much as we can. As shown above, datas are gathered with similarly high density,, which makes it difficult to separate them into different clusters. With smaller epsilon value, too many outliers are detected and with larger value, all datas are clustered as one.

Now we have two candidates and we decided to use both of them for training because they showed similar performance.

For training we preprocessed data for each model and tried to find appropriate hyperparameter values and ran models. Then, we selected models with best evaluation metric. Then we analyzed the results and tried to overcome limitation for unsupervised learning.

For K-means model, we preprocessed data through log normalization as shown on the left side and replaced outliers as shown on the right.

In order to find optimal hyperparmeter k, elbow method was used. We chose k as 6. And to make sure, we checked silhouette coefficient which also tells us that 6 is the optimal number.

Next, for Hierarchical Clustering, data preprocessing was done in similar method as k-means.

In order to compute distance between clusters, we used each methods and concluded that ward method was the most efficient.

Then, we ran hierarchical model and found out that optimal k is 5 or 6, considering intercluster distance through the plot as shown.

Through silhouette coefficient, we were able to conclude that optimal k is 6 since all the clusters reached average of silhouette score, unlike when k is 5.

When k is 5, although two clusters in boxes are similar in distance, left box is clustered into a different cluster while right box is still in same cluster. This proves that optimal k is 6.

Finally, for clustering result, we concluded that there are 6 clusters for both k-means and hierarchical clustering.

From the result, we were able to figure out a remarkable point, which is that each cluster is divided into cases of region and channel.

However, in order to better explain the result, we used supervised method to reveal black box from unsupervised model. First, we changed the cluster labels to one vs all binary labels and trained classifier to discriminate between each cluster and others. Then, we extracted feature importance from model using Random Forest Classifier.