One of the biggest revolution that happened in the past 20 odd years can be the massive and unprecedented growth the processing capabilities of the computers due to the advancement in the hardware. The eventual result of this was the humongous amount of data that began to generate on a daily basis. As the size of the data increased, so did the methods of handling them. Right now, the data and its related problem are business/domain specific. The most common method of solving data driven problem include deploying the knowledge of Machine and Artificial Learning in order to gain meaningful insights, patterns and predictions.
However, as there is a range of business problems, there is also a range of algorithm that can be used which makes it important to understand of the relationship between the algorithms and the various kinds of business problems which helps in choosing the right algorithm leading to better results.
The question – ‘Which algorithm should be used?’ can be answered by first answering a few preliminary questions such as:
a) What is the business problem?
b) If the business objective is Operational or Strategic?
c) How will be the model implemented to solve the business problem?
d) Does the algorithm’s capability match with requirements to solve the business problem?
Once these questions are answered, it becomes easy to narrow down on that algorithm which can be best suited to solve a particular business problem.
The Bag of Algorithms
In today’s age, there are hundreds of machine and deep learning related algorithm which each having their own approach of solving the various kind of problems which range from optimization to classification to segmentation. In order to answer the question that will help in determining the algorithm of choice, the prerequisite is to know the most commonly used algorithms and the category they fall into.
First, the categories can be understood and then algorithms can be assigned to them. The algorithms that we deal with on a day to day basis can be divided in terms of Business problems, Learning Setup, Business Objective and the Implementation Technique.
When it comes to Business problems, the problem can either be –
a) Regression (continuous numeric values are required to be predicte)
b) Classification (predicting certain predetermined classes/categories)
c) Segmentation (Classifying Data in undefined groups)
d) Forecasting (Predicting values over time)
e) Optimization (Optimizing values based on some constraint)
The algorithm work in various kind of setups such as –
a) Supervised Learning Setup (The Y aka target variable is available)
b) Unsupervised Learning Setup (The Y aka target variable is unavailable)
c) Semi-Unsupervised Learning Setup (combination of supervised and unsupervised)
d) Reinforcement Learning Setup (close to supervised learning but with the concept of reward and punishment)
The business objectives can be mainly of two types:
a) Strategic (Long to Mid Term objectives)
b) Operational (Short term objectives)
The various kind of algorithms are deployed using a range technique such as-
a) Traditional aka Classic Statistical Techniques (Algorithms that work purely on the fundamental concepts of statistics)
b) Machine Learning Techniques (Algorithms work using the concept of self-learning)
c) Deep Learning (Algorithms that use the neural network architecture in order to function)
A multitude of algorithms can be easily categorized using the above groups. Once the categories become clear it becomes easy to answer the question that help us in choosing the right algorithm for the problem at hand.
Characteristics of Algorithms
The inherent method of approaching a problem is different in different kind of algorithms and this is exhibited through their characteristics. The characteristics of an algorithm can broadly be understood by focusing on key areas such as –
a) Accuracy: There is a general level of accuracy with sophisticated machine and deep learning algorithms generally having a high level of accuracy while on the other hand the traditional techniques do have a threshold and provide a descent accuracy and don’t out rightly fail which sometimes other algorithms do.
b) Interpretability: The question how we are being able to get the accuracy we are getting is sometimes important and this is where algorithms that provide high accuracy are so complex that their internal working is like a ‘black box’ and are less interpretable. It becomes difficult to understand what the model is doing as the calculations are so lengthy and complex and cannot be understood easily.
c) Type: Models can either be parametric or non-parametric. Parametric models are those where generally we have a mathematical equation involved which works under specific circumstances in order to function properly. In other words, there is set of strict assumptions that should be fulfilled which allow the algorithm that the data belongs to particular family of distributions and tries to create the line of best fit and make predictions.
However, there are other models which can be categorized as non-parametric which have fewer requirements of assumptions to be fulfilled and create no mathematical equation in order to function. There algorithms can be rule based, distance based, probabilistic etc.
d) Data Size Handling Capabilities: Some algorithms can handle that which is in very high dimensions i.e. there are a large number of rows. Similarly, certain algorithm is best for dealing with big data.
e) Resistance to Multicollinearity: Some algorithms especially tree based algorithms are much more resistant to Multicollinearity than say traditional models.
f) Susceptibility to Outliers: Parametric algorithms are generally more vulnerable to outliers while others are not.
g) Assumptions: For some algorithm, there is a set of predetermined assumptions that should be satisfied in order for them to function properly while other, generally non-parametric algorithm have no or less rigid assumption requirements.
h) Training and Testing Phase: The time taken during the training and testing phase differs as certain algorithm that depend on a lot of small isolated calculation taking a lot of time during testing phase while algorithms where mathematical equations are involved take more time in training phase.
Pros and Cons of Algorithms
The bulk of algorithms lie in the predictive models domain, therefore once the characteristics of such models are understood then based on these characteristics, the pros and cons of each of these algorithms can be assessed.
a) Linear / Logistic Regression
Both, Linear and Logistic Regression are parametric methods. They are statistically sound techniques. If compared to other methods (predominantly Machine Learning methods) then they are not as accurate but still provide decent accuracy every time. The most important advantage of these methods is the interpretability as they come up with coefficients for each feature eventually helping in understanding the key drivers. Also, they are efficient as they take less training and testing time. The problems with these methods include peculiar assumption requirements as in order for these algorithms to work properly, it is important that a particular set of assumption is fulfilled such as the dependent variable being normally distributed (for Linear Regression), data have no multicollinearity and heteroscedasticity. Also, as these methods fit a line of best fit, they are highly sensitive to outliers as well as missing values. If the independent variables are not linearly related to the dependent variable then also the usage of these algorithms is limited. With Logistic Regression, the limitation include the lack of capability to capture complex relationship between dependent and independent variables.
b) Support Vector Machines
It is one of the most important and powerful algorithm for performing classification and can also be used for solving regression problems. Support Vector Machine was such a powerful algorithm that it stunted the growth of artificial neural networks for a while as it was more powerful than it. SVM because of its ability to perform kernel transformations can deal with data that are in high dimensions quite effectively. This makes it useful to classify those data where the number of features is very high. SVM also comes with its own set of disadvantages such as that it is not a very good algorithm for solving regression problem as well as multi-class classification problem and its real power lies with binary classification. It can deal with data in high dimensions but it takes a lot of time and therefore is not a very efficient algorithm. Also, just like some other machine learning algorithms, its performance is very much dependent upon the selection of certain hyper parameters along with the kernel function (if it is being used). And like Linear Regression, it also fits a decision line and consequently it is also sensitive to outliers and can lead to overfitting easily.
c) K Nearest Neighbor
One of the simplest algorithms to understand and implement, KNN is widely used for its simplicity. Being a non-parametric method, it has no issue with assumption and the data is not required to be in a particular statistical format for KNN to work properly. The time taken during the training phase if not very high as not major calculation takes place in this phase. Unlike Naive Bayes and Support Vector Machines that are majorly used for solving classification problems, KNN can be used for solving both, Regression and Classification problem with same level of expected accuracy. Also, as it looks for similar events for coming up with predictions, it is widely used as a missing value treatment method also. With the various advantages KNN has, there are certain aspects of it which are both a curse and a boon such as the fact that we don’t have to deal with multiple hyper-parameters to tune even when it’s a machine learning algorithm. He only major hyper-parameter to deal with is the value of K (number of nearest neighbor being considered) however this also caused the major problem – it’s performance is highly dependent upon the value of K. Other hyper-parameters include the distance metric but again there is a huge range of distance metric to choose from with the performance of the algorithm being directly related to the choice of the distance metric. KNN also cannot deal with high dimensional data as it is extremely slow in the testing phase and as it will calculate the distance from the unknown point to all testing points, the time taken to come up with predictions is very high and this problem becomes more evident when dealing with data that have too many features. The disadvantages also include inability to perform proper classification in the event of a class imbalance problem, requirement to scale all features (as it is the case with all the distance based algorithms) and the fact that it is vulnerable to multicollinearity, outliers and missing values.
d) Naïve Bayes
Naive Bayes is a probabilistic method and it able to perform a high number of probabilistic based calculation in a very short period of time to come up with prediction. This is the reason that it is very efficient as it takes less time in the training and testing phase and can deal with data that are in very high dimensions. Because of such features it is among the most widely accepted algorithm for performing text related operation such as Text Classification as it can perform multi-class classification without much problems. Not to be taken lightly, the ease of understanding this algorithm is also one of its biggest advantage as this makes it easy to understand and solve the problem in the case of poor predictions. Among the prominent disadvantages of Naive Bayes is that it only functions for solving classification problem and not regression problems. The assumption that all features are independent is one assumption that is highly difficult to fulfill in real world data, it performs nonetheless still the assumption is there that can cause the algorithm to break at times.
e) Decision Trees
Among the most widely used algorithm, Decision Trees can be used for both Regression and Classification. Decision Tree as an algorithm has a long list of advantages but among the various advantages, the most important ones are that as it is a rule based algorithm, it has no requirement to fulfill any assumptions. It has a decent accuracy level with the high level of interpretability being the most important aspect of its advantages. Decision Tree algorithm is so highly interpretable that its predictions can be visualized (in the form of a tree based diagram) that helps especially in situations where there is a need to convey the process through which the algorithm is coming up with a particular prediction. The advantages don’t end here as because of its inner-functioning, it takes less time during the testing phase, is not vulnerable to outlier or missing values along with having no issues with data that has multicollinearity in it. With all these advantages, there still persist a good number of disadvantages with this algorithm such as the time taken during the training phase which is a bit high. Also, as the size of the data increases, the computational time increases exponentially. The major problem with decision tree, however, is its habit of overfitting to the data and this is the reason that methods such as Random Forest etc. were developed to counter this problem.
f) Random Forest
Random Forest is a special case of Bagging where a number of samples are created using Bootstrapping and Random Subspaces on which decision tress algorithm is fit and multiple accuracies are calculated thereby simplifying the otherwise complex decision boundary consequently addressing the problem of multicollinearity. As it used Decision Trees as the algorithms it has almost the same set of advantages (including the lack of requirement of assumptions, data preparation and a decent level of accuracy) but doesn’t have the same set of disadvantages as it tries to solve the problem of overfitting. The problems of Random Forest are different from Decision Trees such as it can become extremely complex becoming highly computationally expensive and time consuming. Also, the level of increase in the accuracy diminishes with more number of bags (samples) thus the role of tuning hyper-parameters becomes extremely crucial such as the depth of trees and the number of bags. The biggest loss of Random Forest is the lack of interpretability, as it works as a black box and this is one advantage of Decision Tree which is not present with Random Forest as it is one of an ensemble learning method which in general is very less interpretable.
g) Bagging / Boosting
Bagging and Boosting, both are ensemble methods. The aim of both these methods is to address the problem of overfitting. Bagging is a parallel process while Bossting in a sequential method. Both these methods can deploy any algorithm but the most common algorithm is decision trees. As a result they are able to solve classification and regression problems. One of the advantage of such methods is that they don’t require a lot of data to come up with good predictions and can perform decently even with limited data. Having high accuracy and the advantage of not being vulnerable to outliers, missing value, lack of assumption requirement and multicollinearity makes them a very good algorithm to deal with.
The problem with these methods again is the large number of hyper-parameters to tune, time taken during the training phase and most importantly, extremely low interpretability.h.
h) Artificial Neural Networks
ANN is a sophisticated deep learning algorithm that works under a reinforcement learning setup. It is able to provide high level of accuracy and is able to solve both, Regression as well as Classification problems. They don’t require any assumptions in order to work properly and are able to handle data in large quantities. One of the issues with ANN is that it takes a lot of time during the training phase as it has to converge to the best values of weights and bias, however, once the network is trained, the time taken for predictions (testing phase) is very low which makes it an attractive algorithm to work with. Being a deep learning algorithm, ANN can be deployed for solving problems that involve various kinds of data such including data having multimedia (audio, images etc.). It is also not adversely affected by outliers, missing values and multicollinearity. However, there are disadvantages such as that ANN is a very complex algorithm too understand and implement that works well when there a lot of data points. It is also computationally expensive when compared to other algorithms that provide similar level of accuracy (e.g. SVM). As they deal with large amounts of data, there is a possibility of overfitting but the most crucial is that is its extremely low interpretability as it works like a black box making it very difficult to answer questions such as finding the most important drivers (important features deriving the dependent variable).
Answering the 4 Question
With the understanding of the various algorithms, the question can be asked now that will help in determining the best algorithm for the problem at hand.
Q1. What is the business problem?
The most basic question is to identify the business problem at hand and the subsequent conversion of the business problem into a statistical problem in order to attain a solution. Various kinds of business problems can be converted into statistical problem such as a business problem that requires to predict values to a pin-point accuracy can be categorized as a Regression problem and the algorithms such as Linear Regression, Decision Tree Regress or, Random Forest, Bagging and Boosting can be deployed. If the quantity of data is large and high level of accuracy is required than ANN is also an option. A lot of problems require classifying a scenario into pre-conceived classes. In such cases, all the algorithms discussed earlier above (except linear regression) can be deployed. In a scenario where the classes are unknown, segmentation algorithms such as K-Means, DBSCAN etc can be used.
Q2. If the business objective is Operational or Strategic?
It is important to understand if the business solution is required immediately or is required to understand the business trend in order to take some strategic steps. When it comes to business objective being operational, the solution is required in a very short spend of time, sometimes even within seconds. Here the requirement of algorithm being interpretable is low and having good accuracy is very high. Here the solution provided by the algorthms is the final solution. For example, If there is a requirement to identify if a bank transaction is a fraud or not, than this as a business problem, it is a binary classification problem (Fraud(Yes|No)), however the business objective is operational. Here Machine and Deep Learning algorithms such as Random Forest, Bagging, Boosting, ANN etc. can be handy. There can be business problem whose objective can be Operational such as identifying the causes of low profit. Here the requirement is to deploy such an algorithm that will have high level of interpretability which will help in understanding the drivers causing low profit. Here the algorithm will not provide the final solution but will act as a step and catalyst for the management to solve the actual business problem. Algorithms such as Linear Regression or Decision Trees are best suited in such scenarios.
Q3. How will be the model implemented to solve the business problem?
The method in which the algorithm will be implemented also influences the selection of the algorithm. If the implementation is such that detailed reports will be created out of the results or the data will be updated after a delayed period of time than the traditional algorithms is the right choice. However, if there is a requirement for frequently update the model with data being changed continuously and the business problem requires some kind of real time scoring than the machine learning algorithms are the right choice where the decision boundary can be altered continuously. Algorithms especially ANN that can be updated easily and frequently, that build upon the previous knowledge is the best algorithm to go for.
Q4. Does the algorithm’s capability match with requirements to solve the business problem?
Every business problem has its own set of unique constraints. The business problmes where text is involved, the data generally is in very high dimensions and here algorithm such as Naive Bayes can be used. If the business problem has multimedia than deep learning algorithms such as ANN can be used. If the problem is of binary classification then best algorithm is SVM while for multi-class classification problems Ensemble learning methods along with ANN and Navie Bayes can be deployed. If the requirement is to eventually show the solution to a non-data science / non-statistical crowd then algorithms with high interpretation is the right one. All such factors influence if the algorithm is good enough to solve a business problem despite of its isolated qualities or the lack of it.
The Way Forward
The understanding of various algorithms is very important as it helps in choosing the right algorithm and optimizing the accuracy that can be attained from that algorithm. However, it is also important to understand the future of the application of such algorithms. For example, the module ‘Hunga Bunga’ (in beta stage) by scikit learn tests and compares long list of models available in Sklearn once the user feed pre-processed data. This allows the user to apply all the major algorithms in a single go and get a list having the accuracy provided by each algorithm. This still, cannot replace the requirement for a Data Scientist having sound knowledge of various algorithms, knowing the advantages and limitation of each algorithm helping in their proper implementation.