Data Science is an art of getting actionable insights from various forms of data. It is a stack of inter connected tasks – data gathering, data manipulations, data insights, data visualization, statistical analysis, Applied Statistics, Machine Learning, Deep Learning and AI.
An industry striving to implement a certain set of modules from the above tasks or the entire stack of tasks, it has numerous options of tools to get their work done.
Whenever there is an allusion of Data Science and its related happenings around the tech world, one language that gets instantly mentioned – Python. Time and again it has been observed and proved that Python’s compatibility and easy to use syntax makes it the most popular language in the Data Science realm.
Python for Data Science has been a tool of choice for new learners, for people who are researching and exploring things and for the industry level implementations of the Data Science stack.
Python as a Data Science of choice
Python is an open source, object oriented and general purpose scripting language – with capabilities to address the problems and implements the methodologies involved in a Data Science stack. If we start to catalogue the things that helps Python to be the tool of choice, many features come into picture – open-source, ease of coding, scripting potential, portability, compatibility, platform independence, community support – all have played a major role in its rise to be one of the top preferred tool.
But one feature that has made the most difference is Python’s Modularity.
Python ecosystem comprises of thousands of modules and packages that are designed to perform tasks, improve existing tasks and add additional capabilities to the general purpose language that Python is.
Python code can be split across various callable and reusable files called modules, which have an extension of .py and these files contain reusable functions, classes, variables etc. Modules can be further grouped together inside a folder which is known as Python package. In a python code, we can import both modules and packages and use the functions present in them.
Packages (and modules) have been instrumental for Python’s versatile usage and application areas. The best part of it is that many of these packages are developed and maintained by a strong community of enthusiastic users and also backed by a good corporate support from companies like Google, Facebook, Apache, Microsoft etc.
In this section we have described the list of important Python libraries in Data Science stack.
1. Data Gathering
Data collection or gathering is the task of collecting necessary data so that we can proceed with the analysis. This data can include auto-generated data like transactions, customer-related data and sample data collected via various sampling principles. In this section we focus on those python libraries which are instrumental in collecting data from the Internet.
It’s a high performance, open source python framework meant for a large scale web crawling and scraping. It comes pre-loaded with many common functionality, readily available to the programmer and hence reducing the coding work-load. The user creates a framework, known as a spider, which can be deployed on one’s server.
bs4 is a python library meant to extract or pull information from a HTML/XML file. It makes use of a secondary tool to pull a website’s source code, converts it into a bs4 object (an l-xml or html5lib parser) and then using its attributes, one can extract data out of it.
The bs4 is a light and portable, as opposed to the heavier Scrapy but it’s equally effective and popular.
2. Data Manipulations and EDA
We all often hit lot of road-blocks related to the inconsistency in the data. Data can be deficient, erroneous, missing or unrelated and it needs to be corrected and modified. In this process, we would need to impute missing values, remove outliers, drop redundant records, and ensure proper data types and many more.
Once this hurdle is crossed, we start to explore our data to discover patterns and get a good amount of summarisations of the past data and understand both a single variable and its relationship with one or more remaining variables (hypothesis testing) with the help of summary statistics and graphical representations – a process that commonly referred to as Exploratory Data Analysis.
NumPy (Numerical Python) is a signature package for mathematical functions on Python. It supports multi-dim arrays, matrix operations, linear algebra, along with fundamental mathematics.
It uses an array object called ndarray – with faster “locality of reference” accessing of elements, vectorization and broadcasting (vectorized mathematics) functions – which is an added advantage in Data Science operations where speed and resources are of paramount need.
In short, NumPy is Python’s version of MATLAB. Some of the important tasks in Numpy include but not limited to –
- Array operations – slicing, filtering etc
- Basic arithmetic operations in arrays
- Basic statistics – central tendency, dispersion, skewness
- Importing and exporting files
The major limitation of Python for data science stack is the lack of native support and vanilla code in dealing with relational data. This flaw is overcome by pandas, which is an open source and flexible data manipulation tool, built on-top of NumPy and supports relational data. It provides two major data structures – one, a homogenous one dimensional Series and two, a heterogeneous and two dimensional, labelled data structure called DataFrame.
All the data manipulation and EDA tasks can be accomplished by pandas. Some of the important tasks in pandas are –
- Data importing using its pd.read_xxxx() function
- Filtering, sorting and removing duplicates
- Selecting, adding, removing rows and columns
- Imputing missing values, capping outliers
- Data merging and concatenations
- Data reshaping
- Aggregations (groupby())
- Data visualization using its attribute plot(kind=”…”) function built on-top of matplotlib.
- Exporting data, and many more
2.3 SciPy library
SciPy actually refers to the ecosystem of python libraries and software – NumPy, SciPy, IPython, SymPy, pandas, Matplotlib – which are used for scientific computing. SciPy library is a part of this ecosystem. It
provides functions for interpolation, statistics, optimization, integration, and linear algebra to be performed on NumPy ndarray and Pandas Series/DataFrame.
The most prominent use of SciPy (scipy) library is seen in Hypothesis Testing among other mathematical
functions. It provides routines for t-tests, f-test, chi square tests etc. via the scipy.stats module.
Matplotlib is a Python library for creating good quality 2-D visualizations in Python. It’s built over the NumPy library and it is a part of the broader SciPy ecosystem. Introduced in 2002, matplotlib is the fundamental package and with its capability to generate a wide variety of graphs – scatter, bar/column graphs, pie charts, Whisker Plots, Histogram – it has been instrumental in producing graphics in Python.
Matplotlib’s pyplot module is a collection of functions that creates and modifies various plots. Each pyplot function adds some feature to a graphics: e.g., creating a plot, creating a sub-plotting area, decorates the plot with labels, add some text in a plotting area, etc.
Some of the common tasks that are done using pyplot are –
- Define figure, X and Y axes, multiple plots etc.
- Designated function to get graph e.g. pyplot.scatter(), pyplot.bar(), pyplot.pie() etc.
- Adding axes labels, axes ticks, define axes limits etc.
- Adding titles, sub-titles, sub text.
- Add text/labels over the plot
- Viewing the plot on a interface with .show() function
- Exporting or saving the image as a png/jpg file etc.
Seaborn is a graphing library based out of matplotlib, which uses the pyplot canvas and modification functions but it has its own literature of routines. The functions of seaborn are more user-friendly than that of the pyplot.
Given the fact that seaborn internally makes use of matplotlib, it can not only produce all the fundamental plots that are needed for EDA and machine learning tasks but also adds more plots to the arsenal. The plots available in seaborn apart from the basics are heatmaps, bubble charts, correlograms, violin plots, wordclouds, spider chart, tree plot, venn diagrams etc.
2.6 pandas plots
This is a special mention in the visualisation section where one can use DataFrame.plot() function to easily get graphs from the data. Pandas plots make use of all the pyplot function names in the kind argument in plot() to get various graphs like bar, pie, scatter, line etc. The sole requirement here is that the DataFrame needs to have a labeled rows and columns or maybe a pivoted form (wide format) data and we can have an easy plot done.
2.7 Pandas Profiling
Pandas profiling is an open source Python module using which exploratory data analysis can be done using just a few lines of code. It generates interactive reports in a lucid web-based format which can be exported or embedded into a web page or an IPython notebook.
Pandas profiling provides various analysis like type, unique values, missing values, quantiles, central tendencies, and measures of dispersion, sum, skewness, frequent values, histograms, and correlation between variables, count, heat map visualisation and other univariate analysis. Not only a rich analysis report can be generated by fewer lines of code, pandas profiling also gives suitable warnings like missing values, cardinality, zero values etc. which can be leveraged for the machine learning tasks.
3. Predictive Modeling (Applied stats + ML)
In this section, we deal with predicting the future data, otherwise known as Predictive Modelling and the Python packages that play a role in it. Predictive modelling is broadly classified into 4 categories -Regression, Classification, Forecasting and Segmentation and sometimes a mixed problem. Python has strong libraries – Scikit Learn, statsmodels, Tensorflow etc. – that can effectively deal with the above problems.
StatsModels provides classes and functions for the implementation and estimation of various statistical models for the four processes and conducting statistical tests. Built on top of NumPy and SciPy, it can perform Linear Regression, Generalized Linear Models and Generalized Estimating Equations. It also provides graphs by making use of the MATPLOTLIB package.
StatsModels supports specifying inputs to their functions using R-style formulas and return R style outputs, along with extensive compatability and outputs for pandas DataFrames. In a sense, it is much easier for an existing R user to switch into Python. It is easy to create models, its implementation is trouble free with just a few lines of code and most important, it presents the output in a manner that is easier to read and understand.
The scikit-learn is a python module based on NumPy and SciPy, which contains simple but efficient tools for predictive data analysis. Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python along with cross validation functionality to test the estimations.
Scikit-learn not only has numerous features, but also it focuses on production level concerns such as code quality, collaboration, documentation and performance. Some of the models provided by scikit-learn include but are not limited to:
- Feature extraction
- Feature selection: for identifying meaningful variables for supervised learning.
- Dimensionality Reduction: for reducing the number of variables in data using algos like Principal component analysis etc.
- Supervised Models for Regression and Classification based machine learning models.
- Tuning the parameters
- Ensemble methods
- Clustering: for grouping the data into unknown groups by using algorithms like KMeans etc.
- Cross Validation: for estimating the performance of supervised models on unseen data
- Manifold Learning: For summarising and depicting complex multi-dimensional data
- Datasets: for test datasets to investigate model behaviour
An open source library that provides an interface to gradient boosting framework for various languages, including an all explicit package for Python. It is primarily designed for speed and performance.
XGBoost optimises the standard GBM algorithm using Parallelization, Regularization, Tree Pruning, and weighted Quantile Sketch and cross validation. It also has facilities for Continued Training where one can apply further boosting on already fitted model on new data.
The python interface for XGboost is provided using the xgboost module which can automatically perform parallel computation on a single machine which is proven to be faster than other GBMs.
Despite the fact the XGBoost is built for performance, it has its limitations when comes to larger data. This flaw is considerably solved by LightGBM – which is also a distributed algorithm for GBM – but difference is that it splits a tree leaf-wise rather than the standard level-wise approach. This not only reduces processing overhead, but also reduces the losses caused by the latter hereby increasing the accuracy.
Apart from the leaf-wise approach, another feature of LightGBM is that it automatically bins a continuous feature to speed-up the training process and drastically reduces memory usage. In short, its ability to fast train large data and with increased accuracy, LightGBM can be a potential alternative to XGBoost.
4. Natural Language Processing
The NLP refers to a set of processes that deals with working on text data, especially data that contains human language (reviews, tweets, transcripts etc.). A NLU interface usually contains text processing functions for parsing a text based various grammatical components and semantics, so that a computer can ultimately understand the human text.
Further down the lane, this processed data can be leveraged for applying various statistical models such as classification, segmentation, topic discovery etc. Some of the text processing tasks and functions include:
- Cleaning (spell checks, redundant words, remove unused symbols etc)
- Removing stopwords
- Parts of speech tagging
- Stemming and Lemmatization
- Getting usable matrices like Document Term Matrix, TF-IDF, Feature Vector etc.
- Creating advanced data structures like Word Vecors (Word2Vec, FastText etc).
- Statistical and Machine Learning algorithms, viz., Naive Bayes, Support Vector Classifier, kNN Classifier etc. applied for to solve various business problems.
Natural Language Tool Kit is the most fundamental library in Python built for the sole purpose of text preprocessing and NLU. It contains robust functions for tokenization, parsing, stemming, tagging, classification, and lemmatization. It additionally contains interfaces and APIs to many corpora (text repositories) and lexical resources, which gives user an easy access to them and hence build a tool to understand human language.
NLTK, although a robust library, it is somehow limited to academics and experimentation and there are other modules which can prove to be better than it on a production level implementation.
SpaCy is a latest add-on to the NLP realm. It is built for faster text processing and efficient implementation of NLP stack. NLTK works in a traditional way, in the sense that it treats and processes text like a regular string. SpaCy on the other hand, uses Word Vectors, which drastically reduce the processing time.
SpaCy follows a pure Object Oriented approach, where in, it returns a “document object” after processing a text instead of a string, which has tons of functionality via the associated attributes and functions. Despite the speed and power, SpaCy is a bit limited in terms of language support, as it can only support seven languages, as opposed to NLTK which supports many languages.
Gensim is a Python library for NLP but distinctly built for topic modelling, document indexing and similarity retrieval on a large corpora. It has extensive support for word vectors – Word2Vec, FastText etc., just like SpaCy has. Gensim is designed to use word vectors and can process a text without loading it entirely in the memory.
On the bare basics, Gensim has all the pre-processing functions of NLP and has good support for bag of words, n-grams etc.
TextBlob is a Python library for text processing which provides a simple API for diving into common NLP tasks such as parts of speech tagging, named entity recognition etc. But what makes TextBlob stand out more as opposed to the existing libraries are two of its high valued functionalities – one is sentiment analysis and classification and the other is language translations powered by Google Translate. It can work with text both as normal strings and also has routines to integrate with WordNet.
5. Deep Learning and Artificial Intelligence
Deep Learning is a subset of Machine Learning that involves the computer “learning” real world activities, tasks, actions etc. using a large interconnected data. Deep learning enables computer do perform those tasks that usually require human intelligence – where Neural Networks are trained by example to understand the underlying decision making process. Typically, the algorithm learns to perform traditional classification and segmentation tasks using images, videos, text and sound.
For an effective deep training, we require a language that is not only capable of Neural Networks but also has a proven success rate in parsing large unstructured data. This is where Python helps in – with its packages and data handling, lesser code complexity, easier implementations, portability and code collaborations. Let’s walk-through some of the indispensable packages available for deep learning in Python:
Tensorflow is an open-source library from Google that encompasses mathematical methods, machine learning algorithms and Neural Network implementations that are required for deep learning and AI. It not only has the mentioned groundbreaking functions but also has streamlined architecture that enables the code to be deployed on various CPUs, GPUs or maybe, facilitate AI as a service.
Tensorflow architecture enables the user to train their data in their environment and deploy on a network, cloud or a mobile device. One can train the model in different machines as well and then deploy at multiple destinations. It comes bundled with TensorBoard – a tool to visually monitor the process stack and watch out for anomalies.
Tensorflow’s simpler work-flow and accessibility, availability of pre-trained models, datasets etc. make it one of the most preferred libraries for Deep Learning.
Keras is an open-source neural network library built using Python. While TensorFlow deals with a myriad array of tasks, Keras focuses on high-level, efficient Neural Network APIs and hence, more user friendly over some of the available DL libraries. Keras has the ability to run on-top of various libraries like TensorFlow, R, Microsoft Cognitive Toolkit, Theano etc.
By virtue of its added ease of implementation, Keras is used in the academics, research, experimentations and various competitions the are held globally. However, one prominent application area where Keras finds its place is in tool development. Keras helps to build a stand-alone implementation system, like a deep models based product or tool for solving a business problem and implement that as a service.
PyTorch is an open source machine learning library that is built by using Torch library, which is developed and maintained by Facebook’s AI Research lab. It primarily used for computer vision and NLP via its CNN and RNN implementations. PyTorch is really easy to implement with its pythonic way of coding and hence it provides improved developer productivity, has easy debugging features and reinforced support for parallel processing of data.
Model optimization is better with PyTorch than others because of its Dynamic Computational Graphs – where the network behavior can be changed during runtime. The process is never left inside a black box because the users can gauge and access each and every step in the work-flow.
PyTorch is as user friendly as Keras is, and relatively lighter than a heavy TensorFlow to perform rapid prototyping. However, it falls short of building production-ready and deployable solutions.
5.4 Apache MXNet
Apache MXNet is an open-source deep learning API, which is used to train, and deploy deep neural networks. It is developed by Apache Foundation. It is light, flexible, and scalable build and supports CNN, RNN and LSTM models. MX stands for mix and maximize. With its Gluon library MXNet provides a high-level interface that makes it easy to prototype, train, and deploy deep learning models without compromising on the training speed – hence adding a greater performance. As a matter of fact, MXNet-Gluon is proven to be at least 1.5 times faster than TensorFlow.
The prominent application area of Apache MXNet is seen in IoT based analytics. It’s Lazy Evaluation, pruning, quantization, compression, interoperatability with Amazon AWS and other optimization features like acceleration software library such as Intel MKL or NNPACK makes it IoT friendly.
Many mathematical operations in the Data Science stack in Python is highly dependent on NumPy and Deep Learning is not an exception. Theano is optimising compiler for manipulating and evaluating mathematical expressions, especially matrix data structures, using NumPy codes. Its ease of evaluation of mathematical models makes it a tool of choice to build wrapper libraries around Keras and TensorFlow.
5.6 Microsoft Cognitive Toolkit
Microsoft Cognitive Toolkit (CNTK) is an open source library that comprises of all the basic building blocks to build a neural network. One of the major differences between Cognitive Toolkit and other libraries is that it has elaborate low level APIs as well as high level APIs. The high level APIs are meant for an end-user level, easy implementation and the core low level APIs can be modified and restructured for better implementation of neural nets.
CNTK is known for its state of the art batch loading of datasets which makes it easy to handle large volumes of data without having to spend much resources. It can interface itself with Keras and can be well used for production ready applications in the areas of image and video processing, text processing etc.
MS Cognitive Toolkit is reinforced with its ability to integrate with Azure apps and also has APIs for C++ and Java deployment, making it highly scalable interface. CNTK is heavily supported by Microsoft Support and it is a preferred deep learning implementation for organizations that work on Azure shop.
Given the vastness of Python’s capabilities, the extensive development efforts and the never-ending scope of Data Science in general, it would be rather unfair to state that the above mentioned packages are absolute boundaries in the Data Science stack. Also, we have numerous helper packages in the stack which can provide quick but crucial help but owing to its limited usage, we often don’t include them in the discussions.
A beginner to Data Science, learning Python can take one thing into consideration that the overwhelming extent of packages is never a reason to worry about. Python code inherently is easy and the packages are designed in a way that their implementation is lucid and straight forward.
Since 2019, there has been extensive efforts that are being made in democratization of AI, auto ML, auto visualizations and augmented AI. Emerging packages like autoviz, automl, autosklearn, Chartify, jupytext, spacy, Optimus, SHAP, TPOT, AdaNet give us a promising hope in the field of AutoML and auto visualizations – facilitating us to get rid some of the clichéd and labor intensive tasks.
If Data Science is a complex art of getting actionable insights from various form of data, then Python is the artist.