What Is Data Preprocessing in Machine Learning, and Its Importance?
Introduction
In the real world, the data that we work with is raw, it is not clean and needs processing to be ready to be passed to a machine learning model. You may have heard that 80% of a data scientist’s time goes into data preprocessing and 20% of the time for model building. This isn’t false and is actually the case. What is data preprocessing in machine learning, how to do the data cleaning, the need for data preprocessing, what are the different data preprocessing techniques in machine learning, and a lot more is what this article will help you to know in detail. Followed by the data preprocessing steps in machine learning and also look into some of the most frequently asked questions. If you want to know more about what Machine Learning is, its applications and use cases then refer to our previous blogs on What is Machine Learning, the prerequisites to learning Machine Learning and Applied AI, and Different Types of Algorithms of Machine Learning.
Table of Contents
- What Is Meant by Data Preprocessing in Machine Learning?
- Why do we need Data Preprocessing in Machine Learning?
- Which are the Data Preprocessing Techniques?
- Data Preprocessing Steps in Machine Learning
- Concluding Thoughts
- FAQs – Frequently Asked Questions
AnalytixLabs, India’s top-ranked AI & Data Science Institute, is led by a team of IIM, IIT, ISB, and McKinsey alumni. The institute provides a wide range of data analytics courses inclusive of detailed project work which helps an individual to be fit for the professional roles in AI, Data Science, and Data Engineering. With its decade of experience in providing meticulous, practical, and tailored learning, AnalytixLabs has proficiency in making aspirants “industry-ready” professionals.
1. What Is Meant by Data Preprocessing in Machine Learning
The workflow of Machine learning follows as below. As you can see, post the collection and combining the different data sources, data preprocessing in machine learning comes first in its pipeline. Lets’ understand further what exactly does data preprocessing means.
Source: subscription.packtpub.com Data preprocessing in machine learning is the process of preparing the raw data to make it ready for model making. It is the first and the most crucial step in any machine learning model process.
2. Why do we need Data Preprocessing in Machine Learning?
The need for data preprocessing is there because good data is undoubtedly more important than good models and for which the quality of the data is of paramount importance. Therefore, companies and individuals invest a lot of their time in cleaning and preparing the data for modeling. The data present in the real world contains a lot of quality issues, noise, inaccurate, and not complete. It may not contain relevant, specific attributes and could have missing values, even incorrect and spurious values. To improve the quality of the data preprocessing is essential. The preprocessing helps to make the data consistent by eliminating any duplicates, irregularities in the data, normalizing the data to compare, and improving the accuracy of the results. The machines understand the language of numbers, primarily binary numbers 1s and 0s. Nowadays, most of the generated and available data is unstructured, meaning not in tabular form, nor having any fixed structure to the data. The most consumable form of unstructured data is text, which comes in the form of tweets, posts, comments. We also get data in the format of images, audio and as we can see, such data is not present in the format that can be readily ingested into a model. So, for the parsing, we need to convert or transform the data so that the machine can interpret it. Again to reiterate, data preprocessing is a crucial step in the Data Science process.
3. Which are the Data Preprocessing Techniques?
The data preprocessing techniques in machine learning can be broadly segmented into two parts: Data Cleaning and Data Transformation. The following flow-chart illustrates the above data preprocessing techniques and steps in machine learning: Source: ai-ml-analytics
3.1. Data Cleaning/ Cleansing
As we have seen, the real-world data is not all complete, accurate, correct, consistent, and relevant. The first and the primary step is to clean the data. There are various steps in this stage, it involves:
- Making the data consistent across the values, which can mean:
- The attributes may have incorrect data types and are not in sync with the data dictionary. Correction of the data types is a must before proceeding with any type of data cleaning.
- Replace the special characters for example: replace $ and comma signs in the column of Sales/Income/Profit i.e making $10,000 as 10000.
- Making the format of the date column consistent with the format of the tool used for data analysis.
- Check for null or missing values, also check for the negative values. The relevancy of the negative values depends on the data. In the income column, a negative value is spurious though the same negative value in the profit column becomes a loss.
- Smoothing of the noise present in the data by identifying and treating for outliers.
- Errors may also occur at the
Please note the above steps are not comprehensive. The data cleaning steps vary and depend on the nature of the data. For instance, text data consisting of, say, reviews, or tweets would have to be cleaned to make the cases of the words the same, remove punctuation marks, any special characters, remove common words, and differentiate words based on the parts of speech. Now, let’s understand how to handle the missing values and outliers in the data.
3.1.1 Handling the Null/Missing Values
The null values in the dataset are imputed using mean/median or mode based on the type of data that is missing:
- Numerical Data: If a numerical value is missing, then replace that NaN value with mean or median. It is preferred to impute using the median value as the average or the mean values are influenced by the outliers and skewness present in the data and are pulled in their respective direction.
- Categorical Data: When categorical data is missing, replace that with the value which is most occurring i.e. by mode.
Now, if a column has, let’s say, 50% of its values missing, then do we replace all of those missing values with the respective median or mode value? Actually, we don’t. We delete that particular column in that case. We don’t impute it because then that column will be biased towards the median/mode value and will naturally have the most influence on the dependent variable. This is summarized in the chart below:
Outliers Treatment
To check for the presence of outliers, we can plot BoxPlot. To treat the outliers, we can use either cap the data or transform the data:
- Capping the data:
We can place cap limits on the data again using three approaches. Oh yes! there are a lot of ways to deal with the data in machine learning 😀 So, can cap via:
- Z-Score approach: All the values above and below 3 standard deviations and are outliers and can be removed
There are numerous techniques available to transform the data. Some of the most commonly used are:
- Logarithmic transformation
- Exponential transformation
- Square Root transformation
- Reciprocal transformation
- Box-cox transformation
3.2 Data Transformation
Data transformation is different from feature transformation, where the latter is to replace the existing attributes with a mathematical function of these attributes. The transformation on the data that we focus on is to make the numerical and the categorical data machine ready. This is done in the following manner:
3.2.1 Numerical data
The numerical data is scaled, meaning we bring all the numerical data on the same scale. For example, to predict how much loan amount to give to a customer depends on variables such as age, salary, number of working years. Now, on building a linear regression model for this problem, it would not be possible for us to compare the beta coefficients of the above variables as the scale of each variable is different from the others. Hence, the Scaling of the variables is essential. The two ways to scale data are Standardization and Normalization.
- Standardization: On the basis of the Z-score, the numerical data is scaled using the formula of calculating Z values = (x-mean)/standard deviation. The data ranges in the interval of -3 to 3.
- Normalization: Here, the scaling happens using the formula: (x – min)/(max-min), reducing the data in the width of 0 to 1. This is also known as Min-Max Scalar.
3.2.2 Categorical Data
The categorical data can not be directly fed into the model. We have seen machines are black and white, either 1 or 0. So, to use the categorical data for our model building process, we need to create dummy variables. Dummy variables are binary; they can take either the value as 1 or as 0. If we have n types of sub-categories within a categorical column, we must employ n-1 dummy variables. There are two ways to create dummy variables:
- Pandas’ function: pd.get_dummies, and
- sklearn’s in-built function of OneHotEncoder
There is one more way of dealing with the categorical data, which is to use label encoding. The label encoder does not create dummy variables. However, it labels the categorical variable by numbers like below:
- Delhi –> 1
- Mumbai –> 2
- Hyderabad –> 3
There is a limitation of label encoding: it converts the nominal data, which is the categorical data without any order, into ordinal data having order. In the above example, the three cities did not have order. However, the post applying label encoder has values 1,2,3, respectively. The machine will treat this data by giving precedence and treat the numbers as weights like 3 > 2 > 1 will make Hyderabad > Mumbai > Delhi. Hence, due to this limitation of label encoding, handling the categorical data is by creating the dummy variables.
4. Data Preprocessing Steps in Machine Learning
The steps in data preprocessing in machine learning are:
- Consolidation after acquisition of the data
- Data Cleaning:
- Convert the data types if any mismatch present in the data types of the variables
- Change the format of the date variable to the required format
- Replace the special characters and constants with the appropriate values
- Detection and treatment of missing values
- Treating for negative values, if any present depending on the data
- Outliers detection and treatment
- Transformation of variables
- Creation of new derived variables
- Scale the numerical variables
- Encode the categorical variables
- Split the data into training, validation, and test set
We will look into the above data preprocessing steps in machine learning with an example below. We will work with a dataset on loans. All the steps and codes can be accessed from my GitHub repository. The loans data as following features and shapes:
4.1 Importing the Data
4.2 Conversion of the Data Types
Post checking the datatypes of the columns, convert the data types of the following features:
4.3. Missing Values
4.4 Outliers
Z-score:
IQR Method:
A crude way to know whether the outliers have been removed or not is to check the dimensions of the data. From both the above outputs, we can see that the data dimensions are reduced, which implies the outliers are removed.
4.5 Scaling the Numerical Values
Standardization or Z-Score approach:
Normalizing or Min-Max Scaler:
4.6 Encoding the Categorical Variables
Pd.get_dummies approach:
One-Hot and Label Encoding:
4.7 Creation of New Variables
We can use the loan_start and loan_end features to calculate the tenure of the loan.
The number of days in the tenure is currently in TimeDelta. We want it integer hence will do the conversion as follows:
4.8 Splitting data into Train-Test sets
Post the data preparation; we can proceed to build the model by dividing this data into three parts. One is for training the model, the other is to validate the data, and the last part is testing data. Training data is on which the machine learning algorithms are used to build the model. The model learns and identifies the hidden patterns on this dataset.
Validation data is simply used to validate the models that are built. It is used to see how the model performs, checking the training and validation accuracy that helps to know the presence of overfitting or underfitting. This data is used to improve the model, hyper tune the parameters.
Training data is different from the above two sets. The unseen data on which the model is used to predict the values or classes as the case may be.
5. Concluding Thoughts
Data preprocessing in machine learning is the process of preparing the raw data in the form to feed the data into the machine learning model. Precisely, the need of data preprocessing is required due to the following reasons:
- The data is more relevant depending on the nature of the business problem.
- It makes the data more reliable and accurate by removing the incorrect, missing or the negative values (based on the domain of the data).
- The data is also more complete after treating for the missing values.
- The data becomes more consistent by eliminating any data quality issues and inconsistencies present in the data.
- The data is in a format that can be parsed to a machine.
- The features of the algorithm are much more interpretable. readability and interpretability of the data improve.
FAQs – Frequently Asked Questions
Q1. What is the difference between balanced and imbalanced classes?
In balanced data, the number of observations belonging to each class in a classification problem is similar. Imbalanced data is where the number of observations belonging to each class is not equally distributed. In one class, the number of observations is significantly lower than in the other class.
Q2. Which is the correct sequence of data preprocessing?
The sequence of data preprocessing follows:
- Consolidation post data acquisition
- Data Cleaning:
- Convert the data types if any mismatch present in the data types of the variables
- Change the format of the date variable to the required format
- Replace the special characters and constants with the appropriate values
- Detection and treatment of missing values
- Treating for negative values, if any present depending on the data
- Outliers detection and treatment
- Transformation of variables
- Creation of new derived variables
- Scale the numerical variables
- Encode the categorical variables
- Split the data into training, validation, and test set
Q2. What is data cleaning in machine learning?
Data cleaning in machine learning creates reliable data by identifying inaccurate, incorrect, irrelevant data and removing these errors, duplicates, and unwanted data. The data present may also be spurious, having missing or negative values which can impact the model.
Q3. What is the key objective of data analysis?
The primary objective of data analysis is to find meaningful insights within the data to use to make well-informed and accurate decisions.
Q4. What is the difference between Scaling and Transformation?
Scaling | Transformation | |
Purpose | The goal is to compare the variables as scaled variables on the same band can be compared and increase the computational power (or the efficiency). | Transformation helps in the case of skewed variables to reduce the skewness. In the case of regression, either if the assumptions of regression aren’t met or if the relationship between the target and independent variables is non-linear, then can use transformation to linearize. |
Impact on Data | Scaling has no impact on the data. All the properties of the data remain the same—only the range of the independent variables changes. | Transformation changes the data, and so does the distribution of the data. |
Impact on Skewness, Kurtosis, Outliers | As the distribution remains the same so no changes in skewness and kurtosis. Scaling doesn’t remove outliers. | Transformation can decrease the skewness. It brings values closer, which can remove the outliers. |
Q5. What is the difference between Standardization and Normalization?
Standardization and Normalization are scaling techniques. Standardization raises the data based on the Z-score, using the formula (x-mean)/standard deviation, reducing the data width to -3 to 3. Normalization scales the data using the formula (x – min)/(max-min) and Min-Max Scalar. It reduces the data width from 0 to 1.
Q6. What is the difference between Label Encoding and One-Hot Encoding?
Label Encoding | One-Hot Encoding | |
How is the categorical data treated? | Labels the data into numbers | Converts the data into dummy variables, i.e., binary having 1 or 0 as values. |
Example | Male: 1 Female: 2 | Var_Male: 1 and 0 / Var_Female: 0 and 1 |
How to use it in Python? | It can be used via the sklearn package’s function called LabelEncoder | Dummies can be created by either sklearn’s function: OneHotEncoder or Python’s inbuilt function: pd.get_dummies |
Limitation of the method | Changes the nominal data into ordinal making the values given to the categories as weights, and hence the machine accordingly gives those values importance. | The method creates extra redundant columns for each category, and a different column is generated. This increases the dimensions of the data. |
Solution available | Employ Dummy creation or One-Hot encoding technique | Use the various methods available for dimensionality reduction |
Q7. List the different feature transformation techniques.
The common feature transformation techniques are:
- Logarithmic transformation
- Exponential transformation
- Square Root transformation
- Reciprocal transformation
- Box-cox transformation
You may also like to read:
1. What Are the Important Topics in Machine Learning?
2. What Is Regularization in Machine Learning? Techniques & Methods
5 Comments
Great article about data preprocessing in machine learning.Thanks for this article.
Amjat khan