- Overview of Clustering
- Types of Clustering
- Types of Clustering Algorithms with Description
- Applications of Clustering
- Methods of Clustering in a Summary Table
Overview of Clustering
Many things around us can be categorized as “this and that” or to be less vague and more specific, we have groupings that could be binary or groups that can be more than two, like a type of pizza base or type of car that you might want to purchase. The choices are always clear – or, how the technical lingo wants to put it – predefined groups and the process predicting that is an important process in the Data Science stack called Classification.
But what if we bring into play a quest where we don’t have pre-defined choices initially, rather, we derive those choices! Choices that are based out of hidden patterns, underlying similarities between the constituent variables, salient features from the data etc. This process is known as Clustering in Machine Learning or Cluster Analysis, where we group the data together into an unknown number of groups and later use that information for further business processes.
You may also like to read: What is Machine Learning?
So, to put it in simple words, in machine learning clustering is the process by which we create groups in a data, like customers, products, employees, text documents, in such a way that objects falling into one group exhibit many similar properties with each other and are different from objects that fall in the other groups that got created during the process.
Clustering algorithms take the data and using some sort of similarity metrics, they form these groups – later these groups can be used in various business processes like information retrieval, pattern recognition, image processing, data compression, bioinformatics etc. In the Machine Learning process for Clustering, as mentioned above, a distance-based similarity metric plays a pivotal role in deciding the clustering.
In this article, we shall understand the various types of clustering, numerous clustering methods used in machine learning and eventually see how they are key to solve various business problems
Types of clustering methods
As we made a point earlier that for a successful grouping, we need to attain two major goals: one, a similarity between one data point with another and two, a distinction of those similar data points with others which most certainly, heuristically differ from those points. The basis of such divisions begin with our ability to scale large datasets and that’s a major beginning point for us. Once we are through it, we are presented with a challenge that our data contains different kinds of attributes – categorical, continuous data, etc., and we should be able to deal with them. Now, we know that our data these days is not limited in terms of dimensions, we have data that is multi-dimensional in nature. The clustering algorithm that we intend to use should successfully cross this hurdle as well.
The clusters that we need, should not only be able to distinguish data points but also they should be inclusive. Sure, a distance metric helps a lot but the cluster shape is often limited to being a geometric shape and many important data points get excluded. This problem too needs to be taken care of.
In our progress, we notice that our data is highly “noisy” in nature. Many unwanted features have been residing in the data which makes it rather Herculean task to bring about any similarity between the data points – leading to the creation of improper groups. As we move towards the end of the line, we are faced with a challenge of business interpretation. The outputs from the clustering algorithm should be understandable and should fit the business criteria and address the business problem correctly.
To address the problem points above – scalability, attributes, dimensional, boundary shape, noise and interpretation – we have various types of clustering methods that solve one or many of these problems and of course, many statistical and machine learning clustering algorithms that implement the methodology.
The various types of clustering are:
- Connectivity-based Clustering (Hierarchical clustering)
- Centroids-based Clustering (Partitioning methods)
- Distribution-based Clustering
- Density-based Clustering (Model-based methods)
- Fuzzy Clustering
- Constraint-based (Supervised Clustering)
1. Connectivity-Based Clustering (Hierarchical Clustering)
Hierarchical Clustering is a method of unsupervised machine learning clustering where it begins with a pre-defined top to bottom hierarchy of clusters. It then proceeds to perform a decomposition of the data objects based on this hierarchy, hence obtaining the clusters. This method follows two approaches based on the direction of progress, i.e., whether it is the top-down or bottom-up flow of creating clusters. These are Divisive Approach and the Agglomerative Approach respectively.
1.1 Divisive Approach
This approach of hierarchical clustering follows a top-down approach where we consider that all the data points belong to one large cluster and try to divide the data into smaller groups based on a termination logic or, a point beyond which there will be no further division of data points. This termination logic can be based on the minimum sum of squares of error inside a cluster or for categorical data, the metric can be the GINI coefficient inside a cluster.
Hence, iteratively, we are splitting the data which was once grouped as a single large cluster, to “n” number of smaller clusters in which the data points now belong to.
It must be taken into account that this algorithm is highly “rigid” when splitting the clusters – meaning, one a clustering is done inside a loop, there is no way that the task can be undone.
1.2 Agglomerative Approach
Agglomerative is quite the contrary to Divisive, where all the “N” data points are considered to be a single member of “N” clusters that the data is comprised into. We iteratively combine these numerous “N” clusters to fewer number of clusters, let’s say “k” clusters and hence assign the data points to each of these clusters accordingly. This approach is a bottom-up one, and also uses a termination logic in combining the clusters. This logic can be a number based criterion (no more clusters beyond this point) or a distance criterion (clusters should not be too far apart to be merged) or variance criterion (increase in the variance of the cluster being merged should not exceed a threshold, Ward Method)
2. Centroid Based Clustering
Centroid based clustering is considered as one of the most simplest clustering algorithms, yet the most effective way of creating clusters and assigning data points to it. The intuition behind centroid based clustering is that a cluster is characterized and represented by a central vector and data points that are in close proximity to these vectors are assigned to the respective clusters.
These groups of clustering methods iteratively measure the distance between the clusters and the characteristic centroids using various distance metrics. These are either of Euclidian distance, Manhattan Distance or Minkowski Distance.
The major setback here is that we should either intuitively or scientifically (Elbow Method) define the number of clusters, “k”, to begin the iteration of any clustering machine learning algorithm to start assigning the data points.
Despite the flaws, Centroid based clustering has proven it’s worth over Hierarchical clustering when working with large datasets. Also, owing to its simplicity in implementation and also interpretation, these algorithms have wide application areas viz., market segmentation, customer segmentation, text topic retrieval, image segmentation etc.
3. Density-based Clustering (Model-based Methods)
If one looks into the previous two methods that we discussed, one would observe that both hierarchical and centroid based algorithms are dependent on a distance (similarity/proximity) metric. The very definition of a cluster is based on this metric. Density-based clustering methods take density into consideration instead of distances. Clusters are considered as the densest region in a data space, which is separated by regions of lower object density and it is defined as a maximal-set of connected points.
When performing most of the clustering, we take two major assumptions, one, the data is devoid of any noise and two, the shape of the cluster so formed is purely geometrical (circular or elliptical). The fact is, data always has some extent of inconsistency (noise) which cannot be ignored. Added to that, we must not limit ourselves to a fixed attribute shape, it is desirable to have arbitrary shapes so as to not to ignore any data points. These are the areas where density based algorithms have proven their worth!
Density-based algorithms can get us clusters with arbitrary shapes, clusters without any limitation in cluster sizes, clusters that contain the maximum level of homogeneity by ensuring the same levels of density within it, and also these clusters are inclusive of outliers or the noisy data.
4. Distribution-Based Clustering
Until now, the clustering techniques as we know are based around either proximity (similarity/distance) or composition (density). There is a family of clustering algorithms that take a totally different metric into consideration – probability. Distribution-based clustering creates and groups data points based on their likely hood of belonging to the same probability distribution (Gaussian, Binomial etc.) in the data.
The distribution models of clustering are most closely related to statistics as it very closely relates to the way how datasets are generated and arranged using random sampling principles i.e., to fetch data points from one form of distribution. Clusters can then be easily be defined as objects that are most likely to belong to the same distribution.
A major drawback of density and boundary-based approaches is in specifying the clusters apriori to some of the algorithms and mostly the definition of the shape of the clusters for most of the algorithms. There is at least one tuning or hyper-parameter which needs to be selected and not only that is trivial but also any inconsistency in that would lead to unwanted results.
Distribution based clustering has a vivid advantage over the proximity and centroid based clustering methods in terms of flexibility, correctness and shape of the clusters formed. The major problem however is that these clustering methods work well only with synthetic or simulated data or with data where most of the data points most certainly belong to a predefined distribution, if not, the results will overfit.
5. Fuzzy Clustering
The general idea about clustering revolves around assigning data points to mutually exclusive clusters, meaning, a data point always resides uniquely inside a cluster and it cannot belong to more than one cluster. Fuzzy clustering methods change this paradigm by assigning a data-point to multiple clusters with a quantified degree of belongingness metric. The data-points that are in proximity to the center of a cluster, may also belong in the cluster that is at a higher degree than points in the edge of a cluster. The possibility of which an element belongs to a given cluster is measured by membership coefficient that vary from 0 to 1.
Fuzzy clustering can be used with datasets where the variables have a high level of overlap. It is a strongly preferred algorithm for Image Segmentation, especially in bioinformatics where identifying overlapping gene codes makes it difficult for generic clustering algorithms to differentiate between the image’s pixels and they fail to perform a proper clustering.
6. Constraint-based (Supervised Clustering)
The clustering process, in general, is based on the approach that the data can be divided into an optimal number of “unknown” groups. The underlying stages of all the clustering algorithms to find those hidden patterns and similarities, without any intervention or predefined conditions. However, in certain business scenarios, we might be required to partition the data based on certain constraints. Here is where a supervised version of clustering machine learning techniques come into play.
A constraint is defined as the desired properties of the clustering results, or a user’s expectation on the clusters so formed – this can be in terms of a fixed number of clusters, or, the cluster size, or, important dimensions (variables) that are required for the clustering process.
Usually, tree-based, Classification machine learning algorithms like Decision Trees, Random Forest, and Gradient Boosting, etc. are made use of to attain constraint-based clustering. A tree is constructed by splitting without the interference of the constraints or clustering labels. Then, the leaf nodes of the tree are combined together to form the clusters while incorporating the constraints and using suitable algorithms.
Types of Clustering Algorithms with Detailed Description
1. k-Means Clustering
k-Means is one of the most widely used and perhaps the simplest unsupervised algorithms to solve the clustering problems. Using this algorithm, we classify a given data set through a certain number of predetermined clusters or “k” clusters. Each cluster is assigned a designated cluster center and they are placed as much as possible far away from each other. Subsequently, each point belonging gets associated with it to the nearest centroid till no point is left unassigned. Once it is done, the centers are re-calculated and the above steps are repeated. The algorithm converges at a point where the centroids cannot move any further. This algorithm targets to minimize an objective function called the squared error function F(V) :
||xi – vj|| is the distance between Xi and Vj.
Ci is the count of data in cluster.C is the number of cluster centroids.
In R, there is a built-in function kmeans() and in Python, we make use of scikit-learn cluster module which has the KMeans function. (sklearn.cluster.KMeans)
1. Can be applied to any form of data – as long as the data has numerical (continuous) entities.
2. Much faster than other algorithms.
3. Easy to understand and interpret.
1. Fails for non-linear data.
2. It requires us to decide on the number of clusters before we start the algorithm – where the user needs to use additional mathematical methods and also heuristic knowledge to verify the correct number of centers.
3. This cannot work for Categorical data.
4. Cannot handle outliers.
a. Document clustering – high application area in Segmenting text-matrix related like data like DTM, TF-IDF etc.
b. Banking and Insurance fraud detection where majority of the columns represent a financial figure – continuous data.
c. Image segmentation.
d. Customer Segmentation.
2. Hierarchical clustering algorithm
As discussed in the earlier section, Hierarchical clustering methods follow two approaches – Divisive and Agglomerative types. Their implementation family contains two algorithms respectively, the divisive DIANA (Divisive Analysis) and AGNES (Agglomerative Nesting) for each of the approaches.
2.1 DIANA or Divisive Analysis
As discussed earlier, the divisive approach begins with one single cluster where all the data points belong to. Then it is split into multiple clusters and the data points get reassigned to each of the clusters on the basis of the nearest distance measure of the pairwise distance between the data points. These distance measures can be Ward’s Distance, Centroid Distance, average linkage, complete linkage or single linkage. Ideally, the algorithm continues until each data has its own cluster.
In R, we make use of the diana() fucntion from cluster package (cluster::diana)
2.2 Agglomerative Nesting or AGNES
AGNES starts by considering the fact that each data point has its own cluster, i.e., if there are n data rows, then the algorithm begins with n clusters initially. Then, iteratively, clusters that are most similar – again based on the distances as measured in DIANA – are now combined to form a larger cluster. The iterations are performed until we are left with one huge cluster that contains all the data-points.
In R, we make use of the agnes() function from cluster package (cluster::agnes()) or the built-in hclust() function from the native stats package. In python, the implementation can be found in scikit-learn package via the AgglomerativeClustering function inside the cluster module(sklearn.cluster.AgglomerativeClustering)
1. No prior knowledge about the number of clusters is needed, although the user needs to define a threshold for divisions.
2. Easy to implement across various forms of data and known to provide robust results for data generated via various sources. Hence it has a wide application area.
1. The cluster division (DIANA) or combination (AGNES) is really strict and once performed, it cannot be undone and re-assigned in subsequesnt iterations or re-runs.
2. It has a high time complexity, in the order of O(n^2 log n) for all the n data-points, hence cannot be used for larger datasets.
3. Cannot handle outliers and noise
1. Widely used in DNA sequencing to analyse the evolutionary history and the relationships among biological entities (Phylogenetics).
2. Identifying fake news by clustering the news article corpus, by assigning the tokens or words into these clusters and marking out suspicious and sensationalized words to get possible faux words.
3. Personalization and targeting in marketing and sales.
4. Classifying the incoming network traffic into a website by classifying the http requests into various clusters and then heuristically identifying the problematic clusters and eventually restricting them.
3. Fuzzy C Means Algorithm – FANNY (Fuzzy Analysis Clustering)
This algorithm follows the fuzzy cluster assignment methodology of clustering. The working of FCM Algorithm is almost similar to the k-means – distance-based cluster assignment – however, the major difference is, as mentioned earlier, that according to this algorithm, a data point can be put into more than one cluster. This degree of belongingness can be clearly seen in the cost function of this algorithm as shown below:
uij is the degree of belongingness of data xi to a cluster cj
μj is the cluster center of the cluster j
m is the fuzzifier.
So, just like the k-means algorithm, we first specify the number of clusters k and then assign the degree of belongingness to the cluster. We need to then repeat the algorithm till the max_iterations are reached, again which can be tuned according to the requirements.
In R, FCM can be implemented using fanny() from the cluster package (cluster::fanny) and in Python, fuzzy clustering can be performed using the cmeans() function from skfuzzy module. (skfuzzy.cmeans) and further, it can be adapted to be applied on new data using the predictor function (skfuzzy.cmeans_predict)
1. FCM works best for highly correlated and overlapped data, where k-means cannot give any conclusive results.
2. It is an unsupervised algorithm and it has a higher rate of convergence than other partitioning based algorithms.
1. We need to specify the number of clusters “k” prior to the start of the algorithm
2. Although convergence is always guaranteed but the process is very slow and this cannot be used for larger data.
3. Prone to errors if the data has noise and outliers.
1. Used widely in Image Segmentation of medical imagery, especially the images generated by an MRI.
2. Market definition and segmentation.
4. Mean Shift Clustering
Mean shift clustering is a form of nonparametric clustering approach which not only eliminates the need for apriori specification of the number of clusters but also it removes the spatial and shape constraints of the clusters – two of the major problems from the most widely preferred k-means algorithm.
It is a density-based clustering algorithm where it firstly, seeks for stationary points in the density function. Then next, the clusters are eventually shifted to a region with higher density by shifting the center of the cluster to the mean of the points present in the current window. The shift if the window is repeated until no more points can be accommodated inside of that window.
In R, bmsClustering() function from MeanShift package performs the clustering (MeanShift::bmsClustering()) and MeanShift() function in scikit learn package does the job in Python. (sklearn.cluster.MeanShift)
1. Non-parametric and number of clusters need not be specified apriori.
2. Owing to it’s density dependency, the shape of the cluster is not limited to circular or spherical.
3. More robust and more practical as it works for any form of data and the results are easily interpretable.
1. The selection of the window radius is highly arbitrary and cannot be related to any business logic and selecting incorrect window size is never desirable.
1. Image segmentation and computer vision – mostly used for handwritten text identification.
2. Image tracking in video analysis.
5. DBSCAN – Density-based Spatial Clustering
Density-based algorithms, in general, are pivotal in the application areas where we require non-linear cluster structures, purely based out of density. One of the ways how this principle can be made into reality is by using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. There are two major underlying concepts in DBSCAN – one, Density Reachability and second, Density Connectivity. This helps the algorithm to differentiate and separate regions with varying degrees of density – hence creating clusters.
For implementing DBSCAN, we first begin with defining two important parameters – a radius parameter eps (ϵ) and a minimum number of points within the radius (m).
a. The algorithm starts with a random data point that has not been accessed before and its neighborhood is marked according to ϵ.
b. If this contains all the m minimum points, then cluster formation begins – hence marking it as “visited” – if not, then it is labeled as “noise” for that iteration, which can get changed later.
c. If a next data point belongs to this cluster, then subsequently the ϵ neighborhood now around this point becomes a part of the cluster formed in the previous step. This step is repeated until there are no more data points that can follow Density Reachability and Density Connectivity.
d. Once this loop is exited, it moves to the next “unvisited” data point and creates further clusters or noise.
e. The algorithm converges when there are no more unvisited data points remain.
In Python its implemented via DBSCAN() function from scikit-learn cluster module (sklearn.cluster.DBSCAN) and in R its implemented through dbscan() from dbscan package (dbscan::dbscan(x, eps, minpts))
1. Doesn’t require prior specification of clusters.
2. Can easily deal with noise, not affected by outliers.
3. It has no strict shapes, it can correctly accommodate many data points.
1. Cannot work with datasets of varying densities.
2. Sensitive to the clustering hyper-parameters – the eps and the min_points.
3. Fails if the data is too sparse.
4. The density measures (Reachability and Connectivity) can be affected by sampling.
1. Used in document Network Analysis of text data for identifying plagiarism and copyrights in various scientific documents and scholarly articles.
2. Widely used in recommendation systems for various web applications and eCommerce websites.
3. Used in x-ray Crystallography to categorize the protein structure of a certain protein and to determine its interactions with other proteins in the strands.
4. Clustering in Social Network Analysis is implemented by DBSCAN where objects (points) are clustered based on the object’s linkage rather than similarity.
6. Gaussian Mixed Models (GMM) with Expectation-Maximization Clustering
In Gaussian Mixed Models, we assume that the data points follow a Gaussian distribution, which is never a constraint at all as compared to the restrictions in the previous algorithms. Added to that, this assumption can lead to important selecting criteria for the shape of the clusters – that is, cluster shapes can be now quantified. This quantification happens by use to the two most common and simple metrics – mean and variance.
To find the mean and variance, Expectation-Maximization is used, which is a form of optimization function. This function starts with random Gaussian parameters, say θ, and check if the Hypothesis confirms that a sample actually belongs to a cluster c. Once it does, we perform the maximization step where the Gaussian parameters are updated to fit the points assigned to the said cluster. Maximization step aims at increasing the likelihood of the sample belonging to the cluster distribution.
In Python, it is implenteded via the GaussianMixture() function from scikit-learn. (sklearn.mixture.GaussianMixture) and in R, it is implemented using GMM() from the clusteR package. (clusteR.GMM())
1. The associativity of a data point to a cluster is quantified using probability metrics – which can be easily interpreted.
2. Proven to be accurate for real-time data sets.
3. Some versions of GMM allows for mixed membership of data points, hence it can be a good alternative to Fuzzy C Means to achieve fuzzy clustering.
1. Complex algorithm and cannot be applicable to larger data
2. It is hard to find clusters if the data is not Gaussian, hence a lot of data preparation is required.
1. GMM has been more practically used in Topic Mining where we can associate multiple topics to a particular document (an atomic part of a text – a news article, online review, Twitter tweet etc.)
2. Spectral clustering, combined with Gaussian Mixed Models-EM is used in image processing.
Applications of clustering
We have seen numerous methodologies and approaches for clustering in machine learning and some of the important algorithms that implement those techniques. Let’s have a quick overview of business applications of clustering and understand its role in Data Mining.
- It is the backbone of search engine algorithms – where objects that are similar to each other must be presented together and dissimilar objects should be ignored. Also, it is required to fetch objects that are closely related to a search term, if not completely related.
- A similar application of text clustering like search engine can be seen in academics where clustering can help in the associative analysis of various documents – which can be in-turn used in – plagiarism, copyright infringement, patent analysis etc.
- Used in image segmentation in bioinformatics where clustering algorithms have proven their worth in detecting cancerous cells from various medical imagery – eliminating the prevalent human errors and other bias.
- Netflix has used clustering in implementing movie recommendations for its users.
- News summarization can be performed using Cluster analysis where articles can be divided into a group of related topics.
- Clustering is used in getting recommendations for sports training for athletes based on their goals and various body related metrics and assign the training regimen to the players accordingly.
- Marketing and sales applications use clustering to identify the Demand-Supply gap based on various past metrics – where a definitive meaning can be given to huge amounts of scattered data.
- Various job search portals use clustering to divide job posting requirements into organized groups which becomes easier for a job-seeker to apply and target for a suitable job.
- Resumes of job-seekers can be segmented into groups based on various factors like skill-sets, experience, strengths, type of projects, expertise etc., which makes potential employers connect with correct resources.
- Clustering effectively detects hidden patterns, rules, constraints, flow etc. based on various metrics of traffic density from GPS data and can be used for segmenting routes and suggesting users with best routes, location of essential services, search for objects on a map etc.
- Satellite imagery can be segmented to find suitable and arable lands for agriculture.
- Pizza Hut very famously used clustering to perform Customer Segmentation which helped them to target their campaigns effectively and helped increase their customer engagement across various channels.
- Clustering can help in getting customer persona analysis based on various metrics of Recency, Frequency, and Monetary metrics and build an effective User Profile – in-turn this can be used for Customer Loyalty methods to curb customer churn.
- Document clustering is effectively being used in preventing the spread of fake news on Social Media.
- Website network traffic can be divided into various segments and heuristically when we can prioritize the requests and also helps in detecting and preventing malicious activities.
- Fantasy sports have become a part of popular culture across the globe and clustering algorithms can be used in identifying team trends, aggregating expert ranking data, player similarities, and other strategies and recommendations for the users.
Different Clustering Methods
|Hierarchical Clustering||Based on top-to-bottom hierarchy of the data points to create clusters.||Easy to implement, the number of clusters need not be specified apriori, dendrograms are easy to interpret.||Cluster assignment is strict and cannot be undone, high time complexity, cannot work for a larger dataset||DIANA, AGNES, hclust etc.|
|Partitioning methods||Based on centroids and data points are assigned into a cluster based on its proximity to the cluster centroid||Easy to implement, faster processing, can work on larger data, easy to interpret the outputs||We need to specify the number of cenrtroids apriori, clusters that get created are of inconsistent sizes and densities, Effected by noise and outliers||k-means, k-medians, k-modes|
|Distribution-based Clustering||Based on the probability distribution of the data, clusters are derived from various metrics like mean, variance etc.||Number of clusters need not be specified apriori, works on real-time data, metrics are easy to understand and tune||Complex algorithm and slow, cannot be scaled to larger data||Gaussian Mixed Models, DBCLASD|
|Density-based Clustering (Model-based methods)||Based on density of the data points, also known as model based clustering||Can handle noise and outliers, need not specify number of clusters in the start, clusters that are created are highly homogenous, no restrictions on cluster shapes.||Complex algorithm and slow, cannot be scaled to larger data||DENCAST, DBSCAN|
|Fuzzy Clustering||Based on Partitioning Approach but data points can belong to more than one cluster||Can work on highly overlapped data, a higher rate of convergence||We need to specify the number of centroids apriori, Effected by noise and outliers, Slow algorithm and cannot be scaled||Fuzzy C Means, Rough k means|
|Constraint Based (Supervised Clustering)||Clustering is directed and controlled by user constraints||Creates a perfect decision boundary, can automatically determine the outcome classes based on constraints, future data can be classified based on the training boundaries||Overfitting, high level of misclassification errors, cannot be trained on larger datasets||Decision Trees, Random Forest, Gradient Boosting|
Learn about pratical implementation of clustering and many such powerful machine learning techniques with this comprehensive Machine Learning course by AnalytixLabs.