The Hadoop authoritative Guide learning Notes

Last Update:2018-07-20 Source: Internet

Author: User

Tags require advantage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chapter One first knowledge of Hadoop

A lot of data is better than good algorithms.

First, data storage and analysis

To implement parallel reads and writes of multiple disks, you need to solve the problem:

1, hardware failure, once the use of multiple hardware, any hardware failure probability is very high, the way to avoid data loss is to do data backup.

RAID: Redundant disk array is implemented according to the principle of data backup;

Hadoop's file system, or HDFs, is also a class

2. Most analysis tasks need to be combined with most of the data in some way to complete the analysis task, that is, data read from one disk may need to be used in conjunction with data read from another 99 disks.

Hadoop provides a reliable shared storage and analysis system, HDFS implements storage, and MapReduce implements analytical processing, both of which are its core

Second, MapReduce is a batch query processor, and can be processed within a reasonable time frame for the entire data set of instant queries.

Third, why do not use the database to large amounts of large-scale data on the disk for batch analysis. Why MapReduce is needed.

The reason is: the increase of addressing time is much slower than the increase of transmission rate;

Addressing time is the process by which the head moves to a specific disk location for read and write operations, which is the primary cause of disk operation delays, and the transfer rate depends on the bandwidth of the disk.

MapReduce is suitable for batch processing of problems that require analysis of the entire data set, suitable for one-time read-in, multiple-read data applications

RDBMS is suitable for point queries and updates, and is better suited for continuously updated datasets.

MapReduce is ideal for big data updates and RDBMS for small data updates.

Structured data refers to manifested data in a given format, and unstructured data has no particular internal structure.

MapReduce is suitable for the processing of unstructured or semi-structured data. Because the key and value entered by MapReduce are not intrinsic to the data, they are chosen by the person who analyzed the data.

Iv. Grid Computing

High-Performance computing: The assignment of jobs to each machine on the cluster, which accesses shared files organized by the Access area Network SAN, is more suitable for compute-intensive jobs, but when the amount of data accessed increases, many compute nodes are idle due to bandwidth constraints;

MapReduce stores data on compute nodes as much as possible to enable local access to the data.

Data localization access is a core feature of MapReduce and therefore has good performance.

Data bandwidth is the most valuable resource in a data center environment.

MapReduce employs a no-shared framework: it means that each person is independent of each other, so MapReduce detects a failed map or reduce task and lets the normal running machine re-execute these failed tasks.

Chapter II on MapReduce

MapReduce is a programming model for data processing that can be written in multiple languages (Java, Ruby, Python, and C + +) with the advantage of massive data processing.

I. Map phase and reduce phase

1. The MapReduce task process is divided into two processing stages: the map phase and the reduce phase, each with a key/value as input and output, and the programmer chooses their type. Programmers also need to define two functions: map and reduce functions.

2, the map stage input data, after the MapReduce framework processing, the final is sent to the reduce function, this process needs to be sorted and grouped according to key/value pairs.

Second, data flow

The MapReduce job is a unit of work that the client needs to perform: it includes input data, a MapReduce program, and configuration information.

Hadoop divides the job into several small tasks, including two types of tasks: the map task and the reduce task.

Hadoop's mapreduce divides input data into small, equal-length chunks of data, called input shards, and Hadoop builds a map task for each shard that runs the user-defined map function to process each record in the Shard.

1. Having a lot of shards means that it takes less time to process each shard than it takes to process the entire input data, so if we process each shard in parallel and the task for each shard is small, the entire process gets better load balancing.

2. If the shards are too small, the time to manage the slices and the total time to build the map task will determine the entire execution time of the job. For most jobs, a reasonable shard size tends to be the size of one block of HDFs, which is 64MB by default, although the entire default can be adjusted for the cluster. Specify it when you create a new file or create a new file.

3. Hadoop runs map tasks on nodes that store input data (HDFs data) for optimal performance; This is known as Data localization optimization.

Now we should know why the best shard size should be the same size as the BLOCK: it ensures that the maximum size of the input block can be stored on a single node. If the Shard is too large, then it will be allocated to two blocks, then for any one HDFs node, it is basically impossible to store both blocks at the same time, then this shard will be stored on at least two HDFS nodes, so some of the data in the Shard needs to be transferred to the Map task node This method is obviously less efficient than running the entire map task using local data.

4. The map task writes its output to the local hard disk rather than to HDFs, why.

Because the map output is an intermediate result: the intermediate result is processed by the reduce task to produce the final output, and once the job is completed, the map output can be deleted, so it would be a fuss to store it in HDFs and implement backup. If the map task running on that node fails before the map intermediate results are routed to the reduce task, Hadoop will rerun the map task on the other node to build the map intermediate results again.

5. The reduce task does not have the advantage of data localization: The task input of a single reduce usually comes from the output of all mapper.

6. The ordered map output needs to be sent over the network transmission to the node running the reduce task; The data is merged at the reduce end and then processed by the user-defined reduce function.

The reduce output is typically stored in HDFS for reliable storage. For each HDFS block of the reduce output, the first copy is stored on the local node and the other replicas are stored on the other authority nodes, so the reduce output writes to HDFS, which does require network bandwidth, but this is the same as normal HDFs pipelined write consumption.

7. The number of reduce tasks is not determined by the size of the input data, but is specifically specified.

8. If there are multiple reduce tasks, each map task partitions its output, which is a partition for each of the reduce. Each partition has a number of keys and their corresponding values, but the key/value records for each key are in the same partition. Partitions are controlled by user-defined partitioning functions, but are typically partitioned using the default partitioner to borrow a hash function.

9. To avoid data transfer between the map task and the reduce task, Hadoop allows users to specify a merge function for the map task output------The output of the merge function as input to the reduce function.

10. The streaming of Hadoop uses UNIX standard streams as an interface between Hadoop and applications.

Chapter III Hadoop Distributed File system

When the size of a dataset exceeds the storage capacity of a physical computer, it is necessary to partition it and store it on several separate computers; The file system stored across multiple computers on the management network is called a distributed file system.

Hdfs:hadoop Distributed Filesystem

First, the design of HDFs

HDFs stores large files in streaming data access mode and runs on commercial hardware clusters.

1. Extra large files: at least hundreds of MB

2. Streaming data access: The most efficient access mode for one write and multiple reads; the dataset is typically copied from the data source or from the data source, and then for a long period of time the data set is analyzed, and each analysis involves most of the data or even all of the data set. Therefore, it is more important to read the time delay of the entire data set than to read the first record's time delay.

3. Low-Latency data access: low-latency data access is not suitable for operation on HDFs, and HDFs is optimized for high data throughput applications.

4. A large number of small files: Because Namenode stores the file system's metadata in memory, the total number of files that the file system can store is limited by the namenode memory capacity.

5, multi-user write, arbitrary modification of files: There is only one writer in HDFs, and the write operation always adds data at the end of the file; It does not support operations with multiple writers, nor does it support modification anywhere in the file.

Ii. the concept of HDFs

1. Data block

Each disk has a default chunk size, which is the smallest unit of disk read and write. The file system built on disk manages blocks in the file system by managing the chunks in the disk, which can be an integer multiple of the disk block;

File system blocks generally thousands of bytes, while disk blocks are generally 512 bytes.

HDFs also has the concept of block, but much larger, the default is 64MB. Like file systems on a single disk, HDFs files are also divided into blocks of block size, as separate storage units. Unlike other file systems, however, files smaller than one block in HDFs do not occupy the entire block of space.

2, why the block of HDFs so large.

The block of HDFs is larger than the disk block, and is designed to minimize addressing overhead. If the setting is large enough, the time to transfer data from the disk can be significantly greater than the addressing time, so that the time to transfer a file consisting of multiple blocks depends on the transfer rate of the disk.

3. There are many advantages to abstracting a distributed system:

(1) The size of a file can be larger than the capacity of any one disk in the network. Because all the blocks of a file do not need to be stored on the same disk, you can use any disk on the cluster to store it; in extreme cases, the block of one file fills up all the disks.

(2) The design of the storage subsystem can be simplified by using block abstraction rather than the entire file as a storage unit. Because the size of the block is fixed, it is relatively easy to calculate how many blocks a single disk can store, and also eliminates the concern for metadata;

(3) blocks are ideal for data backup and thus provide data fault tolerance and availability.

Like the disk system, the FSCK directive in HDFs can explicitly block information.

4, Namenode and Datanode

The HDFs cluster has two types of nodes and runs in Manager-worker mode:

One is Namennode: Manager

Multiple are datanode: Workers

5, Namenode:

(1) Namenode manages the namespace of the file system, which maintains the file system tree and all files and directories in the entire tree, which are permanently stored on the local disk as files.

(2) Namenode also records data node information for each data block in each file, but it does not permanently store the block's location information because it is rebuilt by the data node at system startup.

6. Clients (client) interact with Namenode and Datanode on behalf of the user to access the entire file system; The client provides a POSIX-like file system interface, so users do not need to know Namenode and Datanode to implement their functionality.

7, Datanode

Datanode are the working nodes of the file system, which store and retrieve data blocks (dispatched by the client Namenode) as needed, and periodically send the Nanmenode a list of the blocks they store.

8, the importance of Namenode:

The file system will not be available without namenode, because without Namenode, all files on the file system will disappear and do not know how to rebuild the file based on the Datanode block.

Therefore, it is important to implement fault tolerance for Namenode, and Hadoop provides two mechanisms:

(1) Back up files that make up the persistent state of the file system metadata. Hadoop can be configured to enable Namenode to preserve the persistent state of element data across multiple file systems. These writes are real-time synchronous, atomic operations, and a general configuration is to write a persistent state to the local disk while writing a remotely mounted Network File system NFS

(2) Run a secondary namenode, but it cannot be used as a namenode. The important role of this auxiliary namenode is to periodically merge namespace mirrors through the edit log to prevent the editing log from being too large. This secondary namenode needs to run on a separate computer because it requires a large amount of CPU time to perform the merge operation with the same amount of memory as Namenode, which saves a copy of the merged namespace mirror and enables it if the Namenode fails But the state of the auxiliary namenode is always lagging behind the primary node, so when the primary node is all dead, it is unavoidable to lose some of the data, if this happens, the Namenode metadata stored on NFS is copied to the secondary Namenode and run as the new primary namenode.

9, File reading analysis:

Step 1, the client calls the FileSystem object's Open method to open the file you want to read;

For HDFs, this object is an instance of the Distributedfilesystem class of the distributed system;

Step 2, distributed filesystem by using RPC to invoke Namenode to determine the location of the file's starting block.

For each piece, Namenode returns the Datanode address where the copy of the block is stored;

In addition, these datanode are sorted according to their distance from the client, and if the client itself is a datanode and holds a copy of the corresponding data block, the node reads the data from the local datanode;

Step 3, the Distributedfilesystem class returns a Fsdatainputstream object (input stream supporting file positioning) to the client and read the data;

The Fsdatainputstream object instead encapsulates the Dfsinputstream object, which manages I/O for Datanode and Namenode.

The client uses the read () method on the input stream (that is, the Fsdatainputstream object);

Step 4, the Datanode address of the file starting block is stored dfsinputstream random connection distance nearest Datanode;

By repeatedly invoking the read () method on the data stream, the data can be transferred from the Datanode to the client;

Step 5, when the end of the block is reached, Dfsinputstream closes the connection to the Datanode and then looks for the best datanode of the next block; The client only needs to read the continuous stream and is transparent to the client

Step 6, when the client reads data from the stream, the block is read in the order of opening Dfsinputstream and datanode new Connection;

It also needs to ask Namenode to retrieve the Datanode position of the next batch of required blocks. Once the read is complete, the close () method is called Fsdatainputstream.

Note:

When reading the data, Dfsinputstream is encountering an error datanode communication, and it tries to read the data from the other nearest datanode of the block. He will also remember the fault datanode to ensure that subsequent blocks on that node will not be read over and over again.

Dfsinputstream also verifies that the data sent from Datanode is complete by verifying the checksum. If a damaged block is found, he will notify Namenode before Dfsinputstream attempts to read a copy of the damaged block from another datanode.

When the block is damaged, will it update the damaged block from the new.

10, File Writing analysis

Step 1: The client invokes the Creat () method to create the file by calling the Distributedfilesystem object;

Step 2:distributedfilesystem Create an RPC call to Namenode and create a new file in the namespace of the file system, at which time there is no corresponding data block in the file;

Namenode performs various checks to ensure that the file does not exist and that the client has permission to create the file;

If these checks pass, Namenode records a record for creating a new file; otherwise, the file creation fails and throws a IOException exception to the client;

Distributedfilesystem returns a Fsdataoutputstream object to the client, whereby the client can write to the data; just like reading data, Fsdataoutputstream encapsulates a

The Dfsoutputstream object that is responsible for communication between Datanode and Namenode.

Step 3: When the client writes the data, Dfsoutputstream divides it into packets and writes to the internal queue, called the data queue;

Step 4:datastreamer processing the data queue, it is the responsibility of the Datanode list to require namenode allocation of appropriate blocks to store data backup;

This set of Datanode constitutes a pipe----assume that the number of replicas is 3, so there are 3 nodes in the pipeline: Datastreamer transfers the packet to the first datanode of the pipeline, The same 2nd Datanode stores the packet and sends it to the 3rd (and last) Datanode in the pipeline.

Step 5:dfsdataoutputstream also maintains an internal packet queue, which is used to wait for a confirmation receipt from the Datanode to be received, which is called the acknowledgment queue Ack queueing queue;

When all Datanode acknowledgement information is received from the pipeline, the packet is removed from the confirmation queue.

Note: If Datanode fails during data write, perform the operation (this is transparent to the client):

First, close the pipeline and confirm that any packets in the queue are added to the front end of the data queue to ensure that the datanode downstream of the fault point does not miss any of the packets;

Then, a new identity is developed for the current block of data stored in another normal Datanode, and the identity is routed to Namenode so that the failed Datanode can delete the stored portion of the data block after the recovery;

Remove the failed data node from the pipeline and write the remaining data blocks to two normal datanode in the pipeline;

Namenode Notice that when the number of blocks is insufficient, a new copy is created on the other node, and subsequent chunks are accepted for normal processing;

There is a small probability that multiple datanode fail simultaneously during the writing of a block;

Step 6: After the client finishes writing the data, it calls the close () method on the data stream;

Step 7: This operation writes all remaining packets to the Datanode pipeline, and waits for confirmation before contacting Namenode and sending complete write signals;

Namenode already knows what blocks the file is composed of (this is the Datastreamer query block allocation), so it only needs to wait for the data block to make a minimum amount of replication before returning to success;

11, the layout of the duplicate

The default policy is the 1th replica above the node running the client: If the client is running outside the cluster, it randomly chooses a node, preferring to choose a node that is not too full or not too busy;

The 2nd copy is placed on a different rack node than the 1th one, and the rack is randomly selected;

The 3rd copy is placed on the same rack as the 2nd copy, and the other node is randomly selected;

If the number of replicas is greater than 3, the other nodes randomly select nodes, but the system tries to avoid placing too many copies on the same rack;

Once the location of the replica is selected, a pipeline is created based on the network topology;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More