Hadoop is an open
source framework that provides a distributed, scalable storage and processing
platform for large data volumes without format requirements on input. Hadoop is
managed by the Apache Software Foundation. It has its origins in Apache Nutch,
a project for an open source search engine. The Nutch Project realised that
Apache Nutch would have difficulties scaling webpages on the internet because
of the volume of webpages. In 2003, Google published a paper on its Google Distributed
Filesystem (GDFS) and the Nutch project identified that a similar system would
present a potential storage solution for the data that was being generated by
their open source search engine. In 2004, the Nutch Project developed Nutch
Distributed Filesystem (NDFS), an open source implementation of GDFS. In 2004, Google published a paper on
MapReduce and by the middle of that year, Nutch developed an open source
version of MapReduce. In 2006, NDFS and MapReduce moved out of Nutch to form
Hadoop. (White, 2012).
Hadoop has two components at its core, the Hadoop distributed file system
(HDFS) which is a storage cluster and the MapReduce engine, a programming model
for data processing. Hadoop has been widely adopted by organisation that
process significant amount of data and some of its use cases will be discussed
Hadoop is becoming the foundation of most big data architectures
and has gained significant adoption in areas where there are huge datasets. Organizations
can either deploy Hadoop and the required supporting software packages in their
local data centre. However, most big data projects depend on short-term use of
substantial computing resources and this use scenario normally utilises scalable cloud services.
The following are examples of some Hadoop
In Data Science, a field where machine learning, statistics,
advanced analysis and programming are combined, Hadoop is fast becoming a key
component of the infrastructure. Data Sciences is relatively new discipline
that draws out hidden insights from huge datasets and Hadoop has been adopted
as a core component for storage and processing of such huge data sets.
Hadoop is also used for data lake technologies in Finance and in
Commerce. A data lake is a shared data environment that comprises multiple
repositories and capitalises on big data technologies. Data lakes amalgamate
data from various sources normally within a single organisation. Hadoop data
lakes are utilised to combine data from multiple data sources and present a
single view of the customer. In Finance, Hadoop can also be used to analyse
credit risk, counterparty or third-party risk using data from multiple sources.
In such cases, Hadoop is used to perform simulations using huge volumes of data
and require parallel computing power, which you find in a typical Hadoop
Hadoop has also been adopted for stream computing where processing
is performed on data directly as it is produced or received before it is stored
enabling organisations to capture, process, ingest and analyse large volumes of
data at high speed have become increasingly important and Hadoop can be used in
such instances. This mainly used in
cases where there is data from sensors or from the Internet of Things (IoT).
In healthcare, Hadoop has been adopted to perform Genome
Processing and DNA Sequencing. Current architecture leverages storage area
networks and attached storage and high-performance computing which may result
in bottlenecks. MapReduce provides efficient storage and compute in a single
platform. Hadoop as a big data technology can also be used to customize
treatment for a patient to continuously monitor the effects of medication. Hadoop
can also aid diagnostics by combining patient data from across multiple data
Since its inception, Hadoop
has gone through various iterations with each iteration bring in new features.
The second iteration, Hadoop 2.x improved resource management and scheduling
that was available in Hadoop 1.x. The latest version is Hadoop 3.x which is considered
stable and of a quality that is production-ready and incorporates a number of
significant enhancements over the previous major release line”.
The changes that were introduced between the different versions
and some of the benefits they bring are summarised below.
Hadoop 1.x had one namespace for the whole cluster which was
managed by a single NameNode. With a single namenode, the HDFS cluster could be
scaled horizontally by adding datanodes but more namespace could not be added
to an existing cluster horizontally. The single Namenode was a single point of
failure and file system operations were limited to the throughput of a single
name node. Hadoop 2.x introduced cluster
federation which is the separation of the
namespace layer and storage layers. Hadoop federation allows
horizontal scaling and uses several NameNodes which are independent of each
In Hadoop1.x, the
processing engine and resource management capabilities of MapReduce are
combined and it a single JobTracker for thousands of TaskTrackers and MapReduce
tasks. It should also be noted that MapReduce had other drawbacks in that it
did not have other workflows such as join, filter, union, intersection etc. in
addition, functions had to read and write to disk before and after Map and
Reduce. Hadoop 2.0 introduced ‘Yet Another Resource negotiator (YARN) which
splits resource management and scheduling into separate tasks. YARN has a
central ResourceManager and an ApplicationMaster which is created for each
individual application allowing multiple applications to run
simultaneously. Hadoop 2.x has improved and
it is scalable up to 10,000 nodes and 400,000 tasks. Hadoop 2.0 introduced
‘Federation’ which allows multiple servers to manage namespaces thereby
allowing horizontal scaling and improve reliability. The Namenodes are now
independent and do not require coordination with each other. (Apache Hadoop, 2017).
Hadoop 1.x had a restricted batch processing model and it only
supported MapReduce processing. As a result, it could only be applied to a
limited set of tasks. YARN has the ability to run none MapReduce tasks inside
Hadoop allowing other applications such as streaming, graph etc. to run thereby
extending the number of tasks that can be performed in Hadoop 2.x compared to
Apache Hadoop 3.x
incorporates a number of significant enhancements over the previous major
release line” and a summary the key enhancements in Hadoop 3.x is as follows:
The initial implementation of HDFS NameNode provided for a single
active and a single standby namenode. Version 3.x allows multiple standby
NameNodes which increases the fault tolerance.
(Apache Hadoop, 2017)
Hadoop was originally developed to support UNIX, a significant
change in Hadoop 3.x has been the introduction of support for Microsoft Azure
Data Lake integration as an alternative Hadoop-compatible filesystem. (Apache Hadoop, 2017)
Hadoop 2.x supports replication for fault
tolerance. Hadoop 3.x offers support for
erasure encoding, a method storing data durably with significant space saving
compared to replication. It should be noted that erasure coding is mainly used
for remote data reading and it is suitable for data that is not accessed
frequently. The YARN resource model has been generalized to support
user-defined resource types GPUs, software licenses, or locally-attached
storage. (Apache Hadoop, 2017)
The changes discussed
above resulted in additional Hadoop use cases and some examples are as follows.
In 2015, Microsoft announced Microsoft
Azure Data Lake, a set of big data storage and analytics services that is built
to be part of Hadoop. Azure Data Lake utilises HDFS and YARN as key components.
(Microsoft , 2015).
The option to run multiple Namenodes in a cluster for example
running two redundant NameNodes in the same cluster in an Active/Passive
configuration with a hot standby allows fast failover to a new NameNode
resulting in high availability. (Apache Hadoop, 2017). Improved availability
and resilience means Hadoop can be extensively adopted by organisations which prioritise
these features for example in Healthcare or Financial Services.
Federation was introduced in Hadoop 2.x and can be used in multi-tenant
environments where a single Hadoop cluster is shared by multiple organisations
or teams within an organisation collaborating on a single project. Federation
allows the cluster to have several different NameNodes or namespaces which are
independent of each other.
The perceived advantage of Spark and it’s additional use cases.
Apache Spark’s supports
data analysis, machine learning, graphs, streaming data functionalities among
others. It can read/write from a range of data types and supports development
in multiple languages.
Spark’s key use case is
centred on its ability to process streaming data. Organisations are processing
significant amounts of data on a daily basis and there is an argument for streaming
and analysing such data in real time. This can be traced back to the rise of social media in
the mid-to-late 2000s. Companies like Facebook, Google and Twitter designed and
launched technology platforms that allow millions of users to share data simultaneously
and in near-real time and in most cases, Hadoop forms the backbone of these
technologies. (IBM , 2017). Spark streaming has
the capability to handle extra workload because it can unify disparate data
processing capabilities, allowing developers to use a single framework to different
Spark incorporates an
integrated framework for advanced analytics which allows users run repeated
queries on datasets. Spark’s scalable
Machine Learning Library (MLlib) is the component that provides this facility. As
a result, Spark is suitable for use on big data functions such as predictive
intelligence and customer segmentation among others.
also has interactive analytics capability. MapReduce was built to handle batch
processing, and SQL-on-Hadoop engines such as Hive or Pig are arguably slower
for interactive analysis. Spark is fast enough to perform exploratory queries
without sampling and it interfaces with a number of development languages.
they are considered to highly flexible, Spark’s in-memory capabilities are not
always one size fit all for all scenarios. For example Spark was not designed for
multi-user environment. Spark users need to know if the memory
they have access to is sufficient for the dataset they will be working on. Coordination
is required where more users are added to a project to ensure users do not
exceed the allocated memory. Inability to handle concurrency might mean that as
the number of users increases, projects may be required to utilise other
engines such as Apache Hive, for large, batch projects.
Spark is used in many notable business industries where companies gather
significant amounts of data from users and feedback with real-time interactions
such as video streaming and many other user interfaces.
should be noted that in Hadoop 1.x, the scheduling was tied with the
MapReduce and processing that was possible on HDFS data was the MapReduce
type. However, this is no longer the case due to changes that have been
implemented via the different iterations of Hadoop and other types of
processing are now available. The changes introduced through the different
iterations also resulted Microsoft Azure being supported in addition to UNIX. In
conclusion, it can be argued that Hadoop (HDFS, MapReduce) provides a reliable
tool for processing schema on read data and the changes being implemented will
continue to bring a paradigm shift in big data processing.