Hadoop 1.0 (0.20.2 ×) was finally released in December 2011, [1 ]. The following is a simple history of hadoop [2 ]:
This is also the most stable version. The new version is under development, that is, 0.23 or 2.0! Many new features are introduced in the new version, which focuses on the following:
- HDFS Federation
- NextGen mapreduce
HDFS Federation
Currently, HDFS includes two layers:
- Namespace)
- Consists of directories, files and blocks
- It supports all the namespace related file system operations such as create, delete, modify and list files and directories.
- Block Storage Service)Has two parts
- Block management (which is done in namenode)
- Provides datanode cluster membership by handling registrations, and periodic heart beats.
- Processes block reports and maintains location of blocks.
- Supports block related operations such as create, delete, modify and get block location.
- Manages replica placement and replication of a block for under replicated blocks and deletes blocks that are over replicated.
- Storage-is provided by datanodes by storing blocks on the local file system and allows read/write access.
It can be simply understood that "namenode" is a file system in a hadoop system, and the basic object for operations is a block, usually 64 m or 128 M. "Datanode" provides file system ing from abstract blocks to operating systems (such as Linux.
The biggest problem with this design is that it is easy to expand the physical storage space horizontally by adding datanode, but there is only one namenode. The obvious problem caused by this is the performance bottleneck of single point of failure.
The new design tries to solve this problem:
In fact, we can also think of this, huh, huh! A new concept is introduced here:Block pool. It is a collection of blocks. Obviously, the blocks in it can come from different datanode. Each pool operates independently and only belongs to one namenode.
Apache hadoop NextGen mapreduce (yarn)
The new version is called mapreduce 2.0 (mrv2) or yarn. The core of mrv2 is to break down two main functions in jontracker, resource management and Job Scheduling/monitoring into two independent daemon processes): Global ResourceManager (RM)
And per-application applicationmaster (AM ). A program can be a previous map-Reduce task or a directed acyclic graph (directed acyclic graph) of a task ).
ResourceManager has two main components: scheduler and applicationsmanager. Scheduler allocates resources. It is a pure scheduler and does not monitor the execution of tasks. Resource Scheduling is based on container. It is an abstraction of memory, CPU, disk, and network. Currently, only memory is supported. Applicationsmanager is responsible for receiving task submissions and negotiating the first container to execute the applicationmaster program and restart the failed applicationmaster container.
Nodemanager is the proxy on each machine (Per-machine framework agent). It is responsible for the container, provides monitoring on resource usage, and reports to ResourceManager/schedager.
The applicationmaster of each program is responsible for negotiating resources and monitoring program execution and progress.
We can see that an applicationmaster can actually use resources on other machines (container), which can provide higher resource utilization and concurrent execution.
Aside from the question, I will introduce a new feature in hadoop 1.0:Webhdfs. This is the HTTP rest API of HDFS. This advantage is obvious, and HDFS can be accessed directly without the hadoop environment. Is it easy to implement cross-data center data access?
Refer:
[1] http://hadoop.apache.org/common/docs/r1.0.0/releasenotes.html
[2] http://hortonworks.com/apache-hadoop-is-here/
[3] http://hadoop.apache.org/common/docs/r0.23.0/index.html
[4] http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html