Analysis of Distribution-based Clustering Methods

Created by

- ERP 25367

Published on December 31, 2022

Data mining and machine learning techniques called clustering involve categorising a dataset into clusters, each cluster consisting of data points that are like one another. Clustering algorithms come in a wide range of variations. In this blog article, we will examine several clustering techniques on various types of datasets, as well as some advantages and disadvantages of applying them.

Introduction

There are more than 100 different clustering algorithms exist, as I previously stated, and the one that is best suited for a given dataset will rely on the properties of the data and the goals of the study. In this blog we will focusing on the following types of clustering algorithms:

Within these clustering methodsProbabilistic models are important because they provide a framework for understanding the uncertainty and variability in the data. Since GMM (Model-based method) is a probabilistic model that assumes that the data points in a cluster are drawn from a mixture of several different normal distributions. The purpose of this blog is to compare GMM with other clustering approaches and analyse on what cases does GMM performs well and in what cases it does not?

Dataset Description

We will be using 10 different datasets and all of them are being collected from UCI Machine Learning Repository. Here is the description:

Dataset description

💡 Important Points

Results and Findings

Before doing hyper parameter tuning, for Mini-Batch Kmeans and Agglomerative, we plotted Elbow curve and Dendrogram to obtain ideal number of clusters for each dataset.

For example: based on elbow curve the number of cluster for dataset 1 is 7 and based on dendrogram is 2 and so on. A total of 10 figures were generated.

Elbow and Dendogram graph

After computing the elbow curve and dendrogram we now must perform hyper parameter tuning for all the models other than GMM.

Let's look at the hyper parameter tuning for first dataset:

Single-parameter-tuning-graph

To find the best parameter combination for each of the listed clustering model, we must choose the highest point for each of the line which gives us the best parameters for each model. (Example: for Mini-batch Kmeans combination 4 is the best as it gives the highest score value and so on for other models). Similarly, this graph was also plotted for each dataset, A total 10 figures were generated:

combined-parameter-tuning-graph

Now after computing the best parameter of each model for each dataset. It’s time to perform Hyper parameter tuning for GMM. This is done by using GridSearchCV and a user-defined score function which returns the negative BIC score, as GridSearchCV is designed to maximize a score (maximizing the negative BIC is equivalent to minimizing the BIC).

Let’s look at the hyper parameter tuning of GMM for first dataset:

single-parameter-tuning-graph-gmm

By keeping the number of component constant i.e., 1 we can clearly observe by changing the covariance type, each of the covariance type yield somewhat similar BIC Value without having much variation and so on. Moreover, we can see that for number of component value 6, type of covariance value equal to diag yield the minimum BIC value as compared to all other combinations. Similarly, this graph was also plotted for each dataset, A total 10 figures were generated:

combined-parameter-tuning-graph-gmm

As we have performed Hyper parameter tuning for each model now, we have to make comparison between GMM score and the rest of the models score for each dataset.

combined-parameter-tuning-graph-gmm

Conclusion

As you can see for half of the datasets GMM score is very less as compared to any other model while for other half datasets GMM did perform well in comparison. There are several possible explanations for why a Gaussian mixture model (GMM) could not function effectively in a specific circumstance.

Some reasons are:

It's also possible that the GMM is not the most appropriate model for the task at hand, and a different type of model would perform better.

While for other half datasets it can be multiple reason of why GMM gives high score, some reasons are: