Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.
Hadoop is used for distributed computing that can be used to query a large set of data and get the results faster using reliable and scalable architecture.
hadoop is a distributed architecture, both data and processing are distributed across multiple servers
In a non distributed architecture, the data stored in one server and any client program will access this central data server to retrieve the data. In a distributed model, you have to add more CPU and storage. This type of architecture is not reliable, as if the main server fails, you have to go back to the backup to restore the data. From performance point of view, this architecture will not provide the results faster when you are running a query against a huge data set.
In distributed architecture, each and every server offers local computation and storage. i.e When you run a query against a large data set, every server in this distributed architecture will be executing the query on its local machine against the local data set. Finally, the resultset from all this local servers are consolidated.
As say simple, instead of running a query on a single server, the query is split across multiple servers, and the results are consolidated. This means that the results of a query on a larger dataset are returned faster.
You don’t need a powerful server. Just use several less expensive commodity servers as hadoop individual nodes.
High fault-tolerance. If any of the nodes fails in the hadoop environment, it will still return the dataset properly, as hadoop takes care of replicating and distributing the data efficiently across the multiple nodes.
The Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications.
Hadoop MapReduce – a programming model for large scale data processing.
In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1, b2, indicates data blocks.
NameNode – manages the file system metadata
Multiple DataNodes ( real cheap commodity servers) – stores the data blocks
When you execute a query from a client, it will reach out to the NameNode to get the file metadata information, and then it will reach out to the DataNodes to get the real data blocks
When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in the hadoop cluster. HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in way that will be reliable and can be retrieved faster. A typical HDFS block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster. so any node failure will never results in a data loss.
Hadoop provides a command line interface for administrators to work on HDFS and NameNode comes with an in-built web server from where you can browse the HDFS filesystem and view some basic cluster statistics.
MapReduce is a parallel programming model that is used to retrieve the data from the Hadoop cluster. In this model, the library handles lot of messy details that programmers doesn’t need to worry about. For example, the library takes care of parallelization, fault tolerance, data distribution, load balancing, etc.
This splits the tasks and executes on the various nodes parallely, thus speeding up the computation and retriving required data from a huge dataset in a fast manner.
They have to just implement two functions: map and reduce The data are fed into the map function as key value pairs to produce intermediate key/value pairs
Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output
JobTracker keeps track of all the MapReduces jobs that are running on various nodes. This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. If any one of those jobs fails, it reallocates the job to another node, etc. In simple terms, JobTracker is responsible for making sure that the query on a huge dataset runs successfully and the data is returned to the client in a reliable manner.
TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.
YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.
For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. The original incarnation of Hadoop closely paired the Hadoop Distributed File System (HDFS) with the batch-oriented MapReduce programming framework, which handles resource management and job scheduling on Hadoop systems and supports the parsing and condensing of data sets in parallel.
YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes. Running on commodity hardware clusters, Hadoop has attracted particular interest as a staging area and data store for large volumes of structured and unstructured data intended for use in analytics applications.
Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can’t wait for batch jobs to finish.
Other Hadoop-related projects at Apache include:
Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
ZooKeeper: A high-performance coordination service for distributed applications.
It’s just only the basic of hadoop ,for more information visit their officel site hadoop.apache.org
In coming posts, I’ll explain how to install and configure Hadoop environment, and how to write MapReduce programs to effectively retrieve the data.