Hadoop is the low-cost software framework that was developed for reliable, scalable, and distributed computing. The world is producing data in the range of zettabytes, and Hadoop is the knight in shining armor for this big data.
Over the years, new technologies and features have been added to the Hadoop ecosystem. The Hadoop ecosystem, in a nutshell, is the umbrella under which various processing technologies gather. Let us explore some of the prominent ones:
Hadoop Distributed File System (HDFS): With this primary storage system, Hadoop manages to support and manage analytic applications for big data in a cheap, rapid, and scalable way. Most of the machines used for Hadoop are low-cost commodities, in which server failure is an unsurprising trend. To cope with this, data is distributed through various servers in different server racks so that data is highly available. Moreover, HDFS allows parallel processing by breaking down data entered into it in smaller chunks which are assigned to separate nodes in a cluster. This increases the analytic speed.
Hadoop Database (HBASE): This system is based on Google Big Table. As the name suggests, HBASE was developed for tables that have billions of rows and millions of columns. With superior fault tolerance and horizontal capacity, this non-relational (non-SQL) database works on top of HDFS. As Hadoop can only manage batch processing, HBASE is used to provide a random access to huge data in a sequential manner.
Hadoop Yarn: The Yarn is the successor of MapReduce. It not only comes near real-time processing by coupling with the in-memory capabilities of other software, such as Apache Spark, but also supports multiple MapReduce API’s in a single cluster; thereby providing better scalability owing to a distributed life-cycle management. Moreover, since it supports several frameworks, it is very flexible to be used in different cases and eliminates the need for MapReduce.
Apache Hive: It is an open-source data warehouse management system that uses HiveSQL, a SQL like scripting language, for analytics. HiveSQL can convert queries to Apache Tez, MapReduce and Spark jobs. It was Facebook that developed Hive; however, it is now being used by several other companies as well.
There are other names in the list as well, like Zookeeper and Mahout, each of which plays a role of its own. The interesting thing to note is how the Hadoop family is increasing by leaps and bounds, and transforming our relationship with big data, gradually, in the process.
What do you say?