Given the wide range of business problems that a Data Scientist has to solve, there is a dearth of informative resources that focus on specific business problems. This article aims specifically to resolve the question- what classification is in machine learning and hopes to address the various aspects of one of the most common business problems (classification in machine learning).
The readers can expect this article to have a better understanding of various aspects of machine learning classification. This would include the very meaning of the terms classification and its understanding in terms of machine learning. The article also provides sufficient knowledge regarding the most common algorithms in classification and the common misconceptions and problems associated with them.
This is a crucial aspect as selecting the wrong algorithm can be the root cause of various problems during the pursuit of seizing a valid business solution. Lastly, no concept can be properly understood until a pragmatic view of things is not considered. Thus, the readers can go through the multitude of examples where classification models are developed to solve a range of problems.
A Brief About Classification in Machine Learning
|Regression Problem||When there is a requirement to predict numeric values in nature (real, continuous, discreet, etc.), there is a need to quantify the impact of numerous variables on a numerical entity (also known as the dependent or Y variable).|
|Classification Problem||This business problem is somewhat similar to regression problems. Here, there is a requirement to quantify the impact of variables (known as independent or X variables) on a dependent variable. However, they are different as here, and the Y variable is categorical. Thus, a model is to be developed based on the X variables and predicts observations into predefined classes.|
|Forecasting Problem||When values are required to be predicted over a period of time and the time acts as a predictor, then these problems are known as forecasting problems.|
|Segmentation Problem||There are situations where a bulk of data needs to be categorized. However, there are no pre-existing classes that can be used to supervise the model. Here the underlying patterns are to be detected and divided the observation into different categories. These categories are then defined by understanding the characteristics of the observations found in each particular class.|
While all the above-mentioned business problems can be found in the industry, the most commonly found business problem is classification. Often businesses require their output to be categorized into predefined classes, and this is where classification models come in handy. With the advancement in Machine Learning, numerous classification algorithms have come to light that is highly accurate, stable, and sophisticated. The creation of a typical classification model developed through machine learning can be understood in 3 easy steps-
Step 1: Have a large amount of data that is correctly labeled. This means that we a large dataset were corresponding to each observation, we know what the “type” or “class” or “category” of it is.
Step 2: Once the data is prepared, selecting one or more classification algorithms and applying them on (typically) the train/development dataset.
Step 3: Tune the hyper-parameters of these classification algorithms and select that algorithm (and its hyperparameters) that provide the best result.
As machine learning is a highly sophisticated field of study, it is of utmost importance for any Data Science aspirant to know how classifiers in machine learning environments behave to form decisions throughout the model’s development stages.
AnalytixLabs is the premier Data Analytics Institute specializing in training individuals and corporates to gain industry-relevant knowledge of Data Science and its related aspects. It is lead by a faculty of McKinsey, IIT, IIM, and FMS alumni who have a great level of practical expertise. Being in the education sector for a long enough time and having a wide client base, AnalytixLabs helps young aspirants greatly to have a Data Science career.
What is Classification?
Earlier, we understood what classification is in terms of machine learning. Let’s also understand what Classification is and the types, fundamentals, and basic properties.
The aim of classification is to determine which category an observation belongs to, and this is done by understanding the relationship between the dependent variable and the independent variables. Here the dependent variable is categorical, while the independent variables can be numerical or categorical. As there is a dependent variable that allows us to establish the relationship between the input variables and the categories, classification is predictive modeling that works under a supervised learning setup.
Depending upon the dependent variable’s nature, different machine learning classification techniques can be understood. Of the various classification techniques, the most common ones are the following-
- Binary Classification
The most basic and commonly used form of classification is a binary classification. Here, the dependent variable comprises two exclusive categories that are denoted through 1 and 0, hence the term Binary Classification. Often 1 means True and 0 means False. For example, if the business problem is whether the bank member was able to repay the loan and we have a feature/variable that says “Loan Defaulter,” then the response will either be 1 (which would mean True, i.e., Loan defaulter) or 0 (which would mean False, i.e., Non-Loan Defaulter). This classification has often formed the basis of various classification algorithms and is the kind of classification technique that is foremost understood.
- Binomial Classification
This classification type is technically like Binary Classification only as the Y variable comprises two categories. However, these categories may not be in the form of True and False. For example, if we have a dataset for multiple features that denote pixel density, we have a Y variable with two categories – “Car” or “Bike,” This type of classification is known as Binomial Classification. From a practical point of view, especially as far as Machine Learning is concerned, there is no difference as these two categories can also be encoded and denoted as 0 and 1, making this type look like a Binary Classification only.
- Multi-class Classification
An advanced form of classification, multi-class classification, is when the Y variable is comprised of more than two classes/categories. Here each observation belongs to a class, and a classification algorithm has to establish the relationship between the input variables and them. Therefore, during prediction, each observation is assigned to a single exclusive class. For example, a business problem where there is a need to categorize whether an observation is “Safe,” “At-Risk,” or “Unsafe” then would be classed as a multi-class classification problem. Note – Each observation can belong to only one class, and multiple classes can’t be assigned to observation. Thus here, observation will either be “Safe” or “At-Risk” or “Unsafe” and can’t be multiple things.
- Multi-label Classification
This form of classification is similar to Multi-class classification. Here, the dependent variable has more than 2 categories; however, it is different from multi-class classification because here, an observation can be mapped to more than one category. Therefore, the classification algorithm here has to understand which classes an observation can be related to and understand the patterns accordingly. A common use-case of these types of classification problem is found in text mining related classification where an observation (e.g., text from a newspaper article) can have multiple categories in its corresponding dependent variable (such as “Politics,” “Name of Politicians Involved,” “Important Geographical Location” etc..).
Thus, a classification can be of multiple types, and depending upon the business problem. We have to identify the kind of classification technique and the algorithms involved in such techniques.
There are numerous algorithms out there that can be used to solve classification problems in machine learning setup. Given the wide use of machine learning, many of these algorithms provide great results and are deployed across domains. Since there are so many algorithms to chose from, it becomes important for a data scientist to have a basic idea of their function, advantages, disadvantages, and typical use cases. However, before that, it’s a good idea to know if these algorithms can be categorized according to their learning mechanism.
While all the classification algorithms work under the supervised learning setup, they can be divided into two groups- Eager Learners and Lazy Learners.
- Eager Learners
This is where most of the classification algorithms lie, as this is the rather typical way of learning the relationship between the input and the target variable. Here (in a one-hold out validation technique), on the train data, patterns are detected, and a quantified relationship is established between the X and Y variable (often in the form of a mathematic equation, rules, a combination of weights, etc.) and this creates a classification model. This quantified relationship is then applied to the test dataset to come up with predictions. A typical trait of such learners is that they have a long development process while implementing the model, and coming up with predictions takes less time. Algorithms that are part of Eager Learners include Logistic Regression, Decision Trees, Support Vector Classifier, Artificial Neural Networks, etc.
- Lazy Learners
They are the opposite of Eager Learners and most of the distance and case-based learners form this group. Here, unlike Eager Learners, the training data is not directly used to establish a relationship between X and Y variables and develop a model. Rather train data is just kept aside, and classification (prediction of classes) is performed on the test data based on the most commonly associated observation (and its label) found in the test data. Quite the opposite to the Eager Learners, the time consumed during is training phase is much longer than the time taken to predict the classes. Typical algorithms that fall under this category include K-nearest Neighbours (KNN), CBL algorithm, etc.
This, however, doesn’t end the ways in which the algorithms can be grouped. For example, algorithms can be grouped in terms of some being linear classifiers while others being not or some being binary classifiers while others having the inherent capability to identify multiple classes.
Common Classification Algorithms Include:
Logistic Regression is among the oldest ways of performing Classification. This algorithm belongs to the family of Generalized Family of models where a logit function is used to transform the Y variable. A linear model can be made to fit to come up with probabilities. In terms of the discipline, Logistic Regression is a classical statistical model and uses the traditional concepts of statistics to come up with probabilities. There are many types of Logistic Regression algorithms such as-.
- Binary Logistic Regression
This algorithm is used when the Y variable is comprised of two categories.
- Multinomial Logistic Regression
When the Y variable comprises three or more than three categories, this logistic regression version is used.
- Ordinal Logistic Regression
When the Y variable comprises ordinal categorical variables, i.e., where the categories have some intrinsic value and can be arranged from a smaller to a larger value (i.e., can be ordered), this type of algorithm is used.
In theory, logistic regression is best used when there is a requirement to solve a binary classification problem. A high level of interpretation is required regarding the role of each variable in determining the class.
This classic rule-based algorithm is often used to perform classification. Being a non-statistical algorithm (in terms of a distribution being involved in coming up with predictions), it has to fulfill every little assumption. The Decision Tree algorithm splits the attributes to come up with rules that are able to differentiate between the classes with maximum efficiency. To have an optimal algorithm, the user is often required to tweak the parameters such as the depth of the tree, a minimum number of observations in each node, etc., making development a bit tricky. Still, it is one of the easiest to implement algorithms. It is especially significant when the process of classification is to be understood visually. Splitting of attributes can be visualized and analyzed. All these things make this algorithm a highly successful, accessible, interpretive algorithm.
Support Vector Classifier
Undoubtedly the most successful classifier, the Support Vector Machine algorithm, was able to halt the development of other much-advanced algorithms such as Artificial Neutral Network due to its impeccable accuracy and sophisticated way of predicting classes. Originally a binary classifier, SVC (Support Vector Classifier) used to be a linear classifier that maximized the two classes’ margins to stabilize the model. Later developments enabled the SVC to solve non-linear and multi-class problems through the innovative use of kernels and other concepts. Today, SVM has often been found to overpower other classification algorithms; however, it lacks interpretability and performance sensitivity to parameters such as the margin value, chosen kernel, value of gamma, etc.
This statistical classifier, rather than using the classic statistics (aka frequentist statistics, which is used by Logistic Regression that uses distribution and fits a sigmoid curve to come up with probabilities), uses the Bayesian theorem to predict classes. Here also, probabilities are predicted for every class; however, these probabilities are predicted after taking into account concepts such as likelihood, evidence, and class prior probability. The problem with NB (Naïve Bayes) is that it is based upon a “naïve” assumption that all the input features are independent of each other and do not influence each other. On the other hand, its most significant advantage is that it can perform classification on datasets that are in very high dimensions (i.e., having a lot of features). This is why it is the most beloved algorithm when dealing with text-based classification problems. There are mainly three types of Naïve Baye algorithms
- Gaussian NB – It is used when all the features and numerical and normally distributed.
- Bernoulli NB – This type is used when all the input features are binary in nature. This is often the case when the input data is a text where each feature is a word, and 0 signifies the absence, and 1 signifies the presence of that word.
- Multinomial NB – When the input features are comprised of (discreet) numbers, i.e., there are more than 2 values, then multinomial NB is used. A typical use of this algorithm is when the input data is a text where each feature is a word and the values are comprised of the number of times a word was present in that document (unlike the previous example where we simply counted if the word was present or not making the features binary causing the user to use Bernoulli distribution).
K Nearest Neighbor
Considered among one of the lazy learners, k nearest neighbors is a distance-based classification algorithm. KNN (K Nearest Neighbor) is used in various classification-based problems that require detecting patterns underlying data, anomaly detection, etc. Also, unlike logistic regression, it is a non-parametric algorithm i.e., it doesn’t use any statistics or is concerned with the distribution of data. KNN’s working can be understood in the following step-
- KNN stores all the training data
- Upon receiving the testing data, it matches each test observation with all the train data observation to find which observation is most similar to it and provide the related class.
- The user is required to tweak parameters such as the value of k which is how many similar train observations are considered to predict for a test observation’s class, among other things such as distance metric, voting mechanism, etc.
While KNN is easy to understand the algorithm and can solve non-linear problems, it has the major disadvantages of being highly dependent upon the value of K and taking a long time in the testing phase.
Several algorithms such as Bagging, Random Forest, AdaBoost, and Gradient Boost are considered part of Ensemble Classifiers. When to come up with the predicted classes, we use not one but more than one algorithm; then, such classifiers are known as Ensemble Classifiers. These classifiers can be of two types.
- Homogeneous: All the algorithms are the same but are trained on different versions of the train data (i.e., never on the complete data) so that the chances of overfitting can be reduced.
- Heterogeneous: Different algorithms work in tandem, and results from different models are combined to provide a single result.
The advantage of Ensemble methods is that they are highly accurate and solve overfitting. Simultaneously, the issues include the user’s problem having to deal with a large number of hyper-parameters that, if not set properly, can severely compromise the functioning of the model.
Artificial Neural Networks
Among the most prominent deep learning algorithms, artificial neural networks can easily solve highly complicated classification problems, especially multi-class non-linear classification problems. ANN (Artificial Neural Networks) is used when dealing with problems such as image classification or problems related to multimedia or other forms of non-structured data. ANN can create highly sophisticated, robust, and reliable classification models as it works under a reinforcement setup where models learn from experience. The algorithm is meant to behave the learning procedure of human beings. The major problem with ANN is that it is a challenging model to create and require a deep understanding of deep learning ‘ inner working models’ inner working and have many parameters to take care of or end up making a dysfunctional model.
Machine Learning Classifiers
Machine Learning classification work in a specific manner. As mentioned earlier also they are required to split the data, which is often a structured dataset (i.e., tabular data), into the train and test datasets. As machine learning uses the concept of considering a large amount of dataset and learning the relationship between the independent and dependent features, predicting classes for a new dataset leads to a major problem. Classification in machine learning classifiers, if not monitored and controlled, can end up memorizing all the patterns found in the train data, which can lead to a classification model providing very high accuracy in the training phase but failing in the test phase. To solve this problem, advanced validation methods such as k-fold cross-validation, leave one out cross-validation, bootstrap methodologies are also used.
Apart from these, typical machine learning classifiers can behave differently depending upon the set hyperparameters that act as the controlling knobs for a model. If not understood properly and set accordingly, these parameters can decimate the performance of any machine learning classification.
Once the parameters are set, the way to evaluate these models is also difficult. Often, most of the algorithms don’t come up with a predicted class but rather predict the probability of an observation belonging to a particular class. This leads to the concept of deciding a threshold value through which classes can be created. As different threshold values can provide other classes, concepts such as ROC, Decile Analysis/KS Statistic come into play and provide the best threshold value that provides the best and most accurate predicted classes. Also, several evaluation metrics, such as confusion metrics, allow us to calculate accuracy metrics such as precision, sensitivity, specificity, etc., that make it possible to find the most accurate classification model.
All these factors are to be taken into account during models performing classification in a machine learning environment. This is why an aspiring data scientist must learn about some concepts before developing a classification model.
As mentioned earlier, the most common business problem is classification. Most of the business requirements to classify their data among various pre-defined categories. Following are the most common classification techniques-
- Email Classification
Among the most traditional and early use of classification, emails were often required to be classified as spam or not spam (i.e., being worthy of being in the inbox). This is done by developing a classifier model that goes through the content of the data and identifies if the email is spam or not. To develop such a classification model, a large amount of well-labeled email is required so that in the training phase, enough data is available that the model doesn’t underfit. As inherently this is a text-based problem, the email contents are converted into features where each feature is a word. The observation is whether the word was present / how many times the word was present in that document. While the traditional email classifiers dealt with binary problems (Spam or Not Spam), modern-day Email classifiers identify if the email is spam and the email’s topic/nature such as social, advertisement, notifications, etc. Similar models are increasingly being used in messaging applications. They also classify the SMS into similar categories to better understand the user experience, refraining them from the tedious task of deleting unnecessary messages.
- Image Classification
The business problem that has haunted the industries and has been a challenge for the researchers for a long time, Image Classification, is finally possible today. Algorithms such as Support Vector Machine and Artificial Neural Networks have made it possible to classify an image with unprecedented accuracy. In theory, the most basic form of classification can be of differentiating cats from dogs or identifying numbers from images. However, the real use case of image classification is rather varied. The common image classification use case include-
- Extensive use of image classification is in the automobile industry, especially those trying to create self-driving cars.
- Security Agencies are increasingly using image classification to identify culprits, while home security systems are heavily relying on Image classification to raise alarms in case of an intrusion.
- Image classification’s lifesaving application has been in healthcare, where image classification has enabled early detection of diseases and has helped develop robots that use computer vision to perform complicated surgeries.
- Anomaly / Fraud Detection
A use case of classification is often used in Banking, Financial Services, and Insurance (BFSI domains). Many transactions in these domains and finding fraud transactions of utmost importance, and sophisticated classification algorithms are tried to create and implemented to work on a real-time basis. For example, a fraud detection model may determine a transaction as a fraud based on the unusual location, purchased product, transaction amount or time, etc. Similar concepts can be applied to find plausible loan defaulters, other non-repayers, etc.
Classification is a fascinating aspect of data science. While there are many algorithms available that solve various business problems, a large number of these algorithms belong to the field of classification. For any Data Scientist, learning classification is of utmost importance, and to properly learn it. Aspirants must focus on all the dimensions of classification- business problems that can be solved through classification, inner workings of algorithms, evaluation, and validation mechanism of a classification model.
What is Classification in Machine Learning, for example?
A typical fundamental question is regarding the functionality of classification in machine learning. In its essence, it’s a model that, based on some labeled data, identifies the relationship between input features and their corresponding classes. This model is then tested by applying it to new data and checking whether the predicted classes match with the original class or not. This leads to the concept of model evaluation and validation. For example, a Bank needs to identify if a loan applicant can default or not, i.e., based on the applicant’s credentials, it is required to find whether the person will repay the loan.
This leads to a classification problem where the past 10 years’ historical data is available, with the applicant’s information and final repayment status. This data is then divided into train and test where the train data a classification algorithm (e.g., logistic regression) is used to develop the model and is applied to the test data to validate the model’s performance. If the accuracy of the model comes out to be acceptable, then the model is sent to production. If not, then the model is to be changed (by tweaking the hyper-parameters or replacing the algorithm whole together).
What are the Classification Techniques?
There are numerous classification techniques. These include binary classification, binomial classification, multi-class classification, and multi-label classification. (See above for a detailed understanding of each of such classification techniques).
What is Classification and Regression in Machine Learning?
Classification and Regression are the same as both while working in the machine learning setup try to identify the pattern between input variables (X variables) and the target variable (Y variable). The difference is that in classification, the Y variable is categorical is comprised of classes, etc., whereas in regression the Y variable is a continuous numerical variable. The difference impacts the functioning of the involved algorithms and methods for evaluating the model.
This article aimed to provide the reader with an understanding of Classification, its role in Data Science, various techniques and algorithms related to it, and the common use cases. If you have any opinions or queries related to this article, then feel free to post and help us in getting more insights.