The structure of Hadoop--hdfs

Source: Internet
Author: User

Before using a tool, it should have a deep understanding of its mechanism, composition, etc., before it will be better used. Here's a look at what HDFs is and what his architecture looks like.

1. What is HDFs?

Hadoop is mainly used for big data processing, so how to effectively store large-scale data? Obviously, the centralized physical server to save data is unrealistic, its capacity, data transmission speed, etc. will become a bottleneck. To realize the storage of massive data, it is bound to use more than 10, hundreds of or even more distributed service nodes. So, in order to manage the data stored on these nodes uniformly, it is necessary to use a special file system-Distributed File system. HDFS (Hadoop Distributed File System) is a distributed filesystem provided by Hadoop.

HDFs has the advantages of large-scale data distributed storage capability, high concurrent access capability, strong fault tolerance, sequential file access, simple consistency model (write-once multiple reads), block storage mode, etc.

Basic framework of 2.HDFS

2.1 Architecture

HDFs one Master-slave mode runs mainly by two types of nodes: one namenode (that is, master) and multiple Datanode (that is, slave), as shown in the framework diagram:

2.2 NameNode, DataNode, Jobtracker and Tasktracker

    1. Namenode is a master server that manages the namespace and metadata of the entire file system and handles file access requests from outside.

Namenode saves three types of metadata for the file system:

      • Namespaces: The directory structure of the entire distributed file system;
      • A mapping table of data blocks and filenames;
      • The location information for each copy of the data block, with 3 copies per block by default.

2. DataNode. HDFs provides a namespace for the user's data to be stored in a file, but internally, the file may be partitioned into chunks of data that datanode to actually store and manage the file.

3. Jobtracker corresponds to namenode,tasktracker corresponding to Datanode (as shown), Namenode and Datanode are for data storage, Jobtracker and Tasktracker are for the execution of MapReduce.

Basic file access process for 2.3 HDFs

    1. The user's application sends the file name to Namenode via the HDFs justifying program;
    2. Namenode after receiving the file name, in the HDFs directory to retrieve the file name corresponding to the data block, in accordance with the data block information to find the Datanode address of the saved data block, send these addresses back to the client;
    3. After the client receives these Datanode addresses, the data transfer operation is performed in parallel with these datanode, and the related logs of the operation results are submitted to Namenode.

2.4 MapReduce Execution Process

    1. Jobclient will package the configured parameters into a jar on the client through the Jobclient class, store it in HDFs, commit the path to Jobtracker, and then create each task by Jobtracker (that is, the map task and the reduce Task), and distribute them to the individual Tasktracker services for execution;
    2. Jobtracker is a master service, after the software is started, the Jobtracker receives the job, is responsible for scheduling the job each subtask task, and monitors them, if a failed task is found to rerun it;
    3. Tasktracker is a slave service that runs on multiple nodes, runs on Datanode nodes in HDFs, actively communicates with Jobtracker, receives jobs, and is responsible for performing each task.

2.5 Secondarynamenode

Secondarynamenode is used in Hadoop to back up the metadata of Namenode backup Namenode so that the Secondarynamenode can be recovered from Namenode when Namenode fails. It acts as a copy of Namenode, which itself does not handle any requests and periodically saves Namenode metadata

Reference Links:

[1]. Hadoop Jobtracker and tasktracker--http://wz102.blog.51cto.com/3588520/1327972

[2]. HDFs Learning (iii)-namenode and datanode--http://shitouer.cn/2012/12/hdfs-namenode-datanode/

[3]. In-depth understanding of big data-large processing and programming practices

The structure of Hadoop--hdfs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.