Hadoop is a free, Java-based
programming framework that supports the processing of large data sets in a distributed computing environment. It is part
of the Apache project sponsored by the Apache Software Foundation.
Hadoop was originally
conceived on the basis of Googles MapReduce, in which an application is broken down into numerous small parts. Any of these parts (also called fragments or
blocks) can be run on any node in the cluster. Hadoop makes it possible to run applications on systems with thousands of nodes involving
thousands of terabytes. A distributed file system (DFS ) facilitates rapid data transfer rate among nodes and allows the system to continue operating uninterrupted in case of a node failure.
risk of catastrophic system failure is low, even if a significant number of nodes become inoperative.
The Hadoop framework is used by major players
including Google, Yahoo and IBM, largely for applications involving search engines and
advertising. The preferred operating
systems are Windows and Linux but Hadoop can
also work with BSD and OS X. Hadoop was originally the name of a stuffed toy elephant belonging to a child of the framework's creator, Doug
FREE MOBILE CLOUD COMPUTING CONCEPTS - TRAINING_MODULES_WITH_TONS_OF_VIDEOS
sure you’ve heard about Big Data.
The most well known technology used for Big Data is Hadoop. Hadoop is used
by Yahoo, eBay, LinkedIn and Facebook. It has been inspired from Google publications on MapReduce, GoogleFS and BigTable.
As Hadoop can be hosted on commodity hardware (usually Intel PC on Linux with one or 2 CPU and a few TB on HDD, without any
RAID replication technology), it allows them to store huge quantity of data (petabytes or even more) at very low cost (compared
to SAN bay systems).
The Hadoop “brand”
contains many different tools. Two of them are core parts of Hadoop:
Hadoop Distributed File System (HDFS) is
a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into
many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance
Hadoop MapReduce is a way to split every request
into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power (describing MapReduce
would worth a dedicated post).
Some other components are often installed on Hadoop solutions:
is inspired from Google’s BigTable. HBase is a non-relational, scalable, and fault-tolerant database that is layered
on top of HDFS. HBase is written in Java. Each row is identified by a key and consists of an arbitrary number of columns that
can be grouped into column families.
ZooKeeper is a centralized service
for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper
is used by HBase, and can be used by MapReduce programs.
Solr / Lucene as
search engine. This query engine library has been developed by Apache for more than 10 years.
Two languages are identified as original Hadoop languages: PIG and Hive. For instance,
you can use them to develop MapReduce processes at a higher level than MapReduce procedures. Other languages may be used,
like C, Java or JAQL. Through JDBC or ODBC connectors (or directly in the
languages) SQL can be used too.
Hadoop Internal Software Architecture
Even if the most known Hadoop suite is provided
by a very specialized actor named Cloudera, big vendors are positioning themselves on Hadoop:
has got BigInsights (Cloudera Hadoop distribution plus their own custom version of Hadoop) and has
recently acquired many niche actors in the analytical and big data market (like Platform Computing which has got a product
enhancing the capabilities and performance of MapReduce)
has launched BigData machine. Based on Cloudera Hadoop, this server is dedicated to storage and usage of
non-structured content (as structured content stays on Exadata)
a tool called HParser to complete PowerCenter This tool is built to launch Informatica process in a MapReduce
mode, distributed on the Hadoop servers.
Microsoft has got a Hadoop
versionfor Microsoft Windows and for Azure, their cloud solution,
and a big native integration with SQL Server 2012.
Some very large database solutions like EMC Greenplum
(partnering with MapR), HP Vertica, Teradata Aster Data (partnering with HortonWorks) or SAP Sybase
IQ are able to connect directly to Hadoop storage.
++++++++++++++ Hadoop software market to hit $812.8 million in 2016, says IDC
IDC put the Hadoop-MapReduce ecosystem market at $77 million in 2011. That’ll change in a hurry.
The market for Hadoop and MapReduce related software will grow at a compound annual growth rate of more than 60 percent
through 2016, according to IDC data.
IDC put the Hadoop-MapReduce ecosystem market at $77 million in 2011. That sum sounds small given the focus on big data and the headlines
that go with it. But the financials will catch up with the attention. IDC is projecting a Hadoop-MapReduce market of $812.8
million in 2016.
Big data has received a lot of focus as companies aim to crunch structured and unstructured
data to see around corners. The growth of big data software could also pose a threat to database incumbents such as Oracle.
IDC expects the Hadoop-MapReduce market to develop like Linux did. Linux began with a lot of attention and a small
market and then grew to be commonplace in most data centers.
The one wild card for big data growth will
be talent to crunch the figures as well as analyze them. +++++++++++++++++++=
Hadoop is a set of open source technologies
that supports reliable and cost-efficient ways of dealing with large amounts of data. Given the vast amounts of business critical
and required data companies gather (e.g. required due to Sarbanes–Oxley (SOX) or EU Data Retention Directive), Hadoop becomes increasingly relevant.
Several Hadoop technologies are inspired by Google’s infrastructure.
1. Processing and Storage
– Mapreduce Mapreduce can be used to process and extract knowledge from arbitrary amounts of data, e.g.
web data, measurement data or financial transactions – Visa reduced their processing time for transactional statistics
from 1 month to 13 minutes with Hadoop. In order to use Mapreduce developers need to parallelize their problem and program against an API – here for an example of machine learning
with Hadoop. Hadoop’s
Mapreduce is inspired by the paper MapReduce: Simplified Data Processing on Large Clusters.
1.2 File Storage – HDFS HDFS is scalable and distributed file
system. It supports configurable degree of replication for reliable storage even when running on cheap hardware. HDFS is inspired
by the paper The Google File System
1.3 Database – HBase HBase is a distributed database that supports
storing billions of rows with millions of columns that runs on top of HDFS. HBase can replace traditional databases if they
get problems scaling or become to expensive licence-wise, see this presentation about Hbase. HBase is inspired by the paper Bigtable: A Distributed Storage System for Structured Data
2. Data Analysis
Mapreduce can be
used to analyze all kinds of data (e.g. text, multimedia, numerical data) and have high flexibility, but for more structured
data the following Hadoop Technologies can be used:
2.1 Pig SQL-like language/system running on
top of Mapreduce. Pig is developed by Yahoo and inspired by the paper Interpreting the Data: Parallel Analysis with Sawzall
2.2 Hive Datawarehouse running
on top of Hadoop, developed by Facebook. Query language is very similar to SQL.
3. Distributed Systems Development
Avro Avro is used for efficient serialization of data and communication between services. It is in several ways
similar to Google’s protocolbuffers and Facebook’s
3.2 Zookeeper Coordination between distributed processes. It is inspired by the paper The Chubby lock service for loosely-coupled distributed