Hadoop HDFS and MAP/reduce

Source: Internet
Author: User
HDFS

HDFSIt is a distributed file system with high fault tolerance and is suitable for deployment on cheap machines. It has the following features:

1) suitable for storing very large files

2) suitable for stream data reading, that is, suitable for "write only once, read multiple times" data processing mode

3) suitable for deployment on cheap machines

However, HDFS is not suitable for the following scenarios (everything should be divided into two sides. Only technologies suitable for your own business are really good technologies ):

1) It is not suitable for storing a large number of small files because it is limited by the namenode memory size.

2) It is not suitable for Real-time Data Reading. high throughput is contrary to real-time performance. HDFS selects the former.

3) it is not suitable for scenarios where data needs to be modified frequently.




Shows the architecture of HDFS. In general, the Master/Slave architecture is adopted, which consists of the following four parts:

1. Client

2. namenode

The entire HDFS cluster has only one namenode, which stores the metadata information of the entire cluster file separately. The information is stored on the local disk using the fsimage and editlog files. The client can find the corresponding files through the metadata information. In addition, namenode monitors the health status of datanode. Once a datanode exception is found, it is kicked out and copied to other datanode.

3. Secondary namenode

Secondary namenode is responsible for regularly merging the fsimage and editlog of namenode. Note that it is not a hot standby of namenode, so namenode is still a single point of failure. It mainly aims to share part of namenode work (especially memory consumption, because memory resources are very precious to namenode ).

4. datanode

Datanode is responsible for the actual storage of data. When a file is uploaded to an HDFS cluster, it is distributed in each datanode Based on blocks. To ensure data reliability, each block is written to multiple datanode at the same time (3 by default)




Mapreduce

Like HDFS, mapreduce uses the Master/Slave architecture. Its architecture is as follows:



It consists of the following four parts:

1) Client

2) jobtracker

Jobtracke is responsible for resource monitoring and job scheduling. Jobtracker monitors the health status of all tasktrackers and jobs. Once a task fails, it transfers the task to another node. Meanwhile, jobtracker tracks the task execution progress, resource usage, and other information, and tell the task scheduler the information. When the resources are idle, the scheduler selects the appropriate task to use these resources. In hadoop, the task scheduler is a pluggable module. You can design a scheduler as needed.

3) tasktracker

Tasktracker periodically reports the resource usage and task running Progress on the current node to jobtracker through heartbeat, receive commands sent by jobtracker and execute relevant operations (such as starting a new task or killing a task ). Tasktracker uses the "slot" to divide the amount of resources on the current node. "Slot" indicates computing resources (CPU, memory, etc ). A task can run only after it obtains a slot. The hadoop scheduler assigns idle slots on each tasktracker to the task. Slot can be divided into map slot and reduce slot for maptask and reduce task respectively. Tasktracker limits the concurrency of tasks by the number of slots (configurable parameters.

4) task

Tasks are divided into map tasks and reduce tasks, both started by tasktracker. HDFS stores data with a fixed block size as the basic unit. For mapreduce, the processing unit is split. Split is a logical concept that only contains metadata, such as the start position, length, and node of the data. The division method is completely determined by the user. However, it should be noted that the number of splits determines the number of map tasks, because each split is handed over to only one map task for processing. Shows the relationship between Split and block:



Shows the execution process of map tasks. The figure shows that the map task first parses the corresponding split iteration into key/value pairs, and calls the User-Defined map () function in sequence for processing, the temporary results are stored on the local disk. The temporary data is divided into several partitions, and each partition is processed by a reduce task.



The execution process of reduce tasks is shown in. This process is divided into three phases:

① Read the intermediate results of maptask from a remote node (called the "Shuffle stage ");

② Sort key/value pairs by key (called the "Sort stage ");

③ Read <key, Value List> in sequence, call the User-Defined reduce () function for processing, and save the final result to HDFS (called the "reduce stage ").


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.