Hadoop Study Notes (6): internal working mechanism when hadoop reads and writes files

Source: Internet
Author: User
Read files

For more information about the file reading mechanism, see:

The client calls the open () method of the filesystem object (corresponding to the HDFS file system, and calls the distributedfilesystem object) to open the file (that is, the first step in the figure ), distributedfilesystem uses Remote Procedure Call to call namenode to obtain the location of the first several blocks of the file (step 2 ). For each block, namenode returns the address information of all namenode that owns the block backup (sorted by the distance between the cluster's topology network and the client, for more information about how to make a network topology in a hadoop cluster, see the following ). If the client itself is a datanode (for example, the client is a mapreduce task) and the datanode itself has a required file block, the client reads the file from the local device.
After the preceding steps are completed, distributedfilesystem returns a fsdatainputstream (supporting file seek). The client can read data from fsdatainputstream. Fsdatainputstream encapsulates a dfsinputsteam class to process namenode and datanode I/O operations.
The client then executes the read () method (Step 3). dfsinputstream (which stores the location information of the first blocks to read the file) connects to the first datanode (that is, the latest datanode) to obtain data. By repeatedly calling the read () method (step 4 and step 5), the data in the file is streamed to the client. When reading the end of the block, dfsinputstream closes the stream pointing to the block, finds the location information of the next block, and repeatedly calls the read () method to continue stream reading of the block. These processes are transparent to users. In the user's opinion, this is the continuous stream reading of the entire file.
When a real file is read, the client calls the close () method in fsdatainputsteam to close the file input stream (Step 6 ).
If an error is detected when a block is read by dfsinputstream, dfsinputsteam connects to the next datanode to obtain other backups of the block, at the same time, it will record the previously detected bad datanode to avoid unnecessary repeated reading of this datanode. Dfsinputsteam also checks the checksum of the data read from datanode. If any data corruption is found, it reports the bad block to namenode and re-reads other block backups on other datanode.
One advantage of this design mode is that file reading is distributed across datanode in this cluster. namenode only provides the location information of the file block, which requires little bandwidth, in this way, the single point of failure (spof) can be effectively avoided to expand the cluster Scale.

Network topology in hadoop

How can we measure the distance between two nodes in a hadoop cluster? You know, when processing data at a high speed, the only factor limiting the data processing rate is the data transmission speed between different nodes: This is caused by the terrible lack of bandwidth. Therefore, we use bandwidth as the standard to measure the distance between two nodes.
However, calculating the bandwidth between two nodes is complicated and must be measured in a static cluster, however, hadoop clusters generally change dynamically with the scale of Data Processing (and the number of connections directly connected to two nodes is the square of the number of nodes ). Therefore, hadoop uses a simple method to measure the distance. It represents the network in the cluster as a tree structure. The distance between two nodes is the sum of the distance between them and the common ancestor nodes. The tree is generally organized according to the structure of the data center, rack, and computing node (datanode. The local computing speed on the computing node is the fastest, and the computing speed across data centers is the slowest (currently, hadoop clusters across data centers are rarely used, generally, operations are performed in a data center ).
If a compute node N1 is located on the rack r1 of data center C1, it can be expressed as/C1/R1/N1. The following shows the distance between the two nodes in different cases:
• Distance (/D1/R1/N1,/D1/R1/N1) = 0 (processes on the same node)
• Distance (/D1/R1/N1,/D1/R1/N2) = 2 (different nodes on the same rack)
• Distance (/D1/R1/N1,/D1/R2/N3) = 4 (nodes on different racks in the same data center)
• Distance (/D1/R1/N1,/D2/R3/n4) = 6 (nodes in different data centers)
As shown in:

  

Write files

Now let's take a look at the parsing of the file writing mechanism in hadoop. Through the file writing mechanism, we can better understand the consistency model in hadoop.

This example shows how to create a new file and write data to it.
First, the client uses the CREATE () method on distributedfilesystem to specify the name of the file to be created (step 1). Then, distributedfilesystem applies to namenode for creating a new file (step 2, at this time, the file has not been allocated a corresponding block ). Namenode checks whether a file with the same name exists and whether the user has the corresponding creation permission. If the check succeeds, namenode creates a new record for the file. Otherwise, the file creation fails, the client returns an ioexception. Distributedfilesystem returns a fsdataoutputstream for the client to write data. Similar to fsdatainputstream, fsdataoutputstream encapsulates a dfsoutputstream to Process Communication Between namenode and datanode.
When the client starts writing data (Step 3), dfsoutputstream divides the written data into packages and puts them in an intermediate queue-data queue. Datastreamer retrieves data from the data queue and applies for a new block from namenode to store the acquired data. Namenode selects a set of appropriate datanode (the number is determined by the number of replica files) to form a pipeline line (pipeline). Here we assume that replica is 3, so there are three datanode in the pipeline line. Datasteamer writes data stream to the first datanode in the pipeline (step 4), and the first datanode transfers the received data to the second datanode (step 4 ), and so on.
Dfsoutputstream also maintains another intermediate queue-ack queue ), the package in the validation queue is removed from the validation Queue (step 5) only after confirmation of all datanode in the pipeline is obtained ).
If a datanode fails to write data, the following steps are transparent to users:
1) when the MPs queue is closed, all data in the validation queue will be moved to the data queue's header for resend, this ensures that data packets are not lost due to the dropped datanode In the MPs queue.
2) mark the current block on the datanode that is still running normally, in this way, after the Dangdang datanode restarts, namenode will know which block on the datanode is the partial damaged block left in the machine just now, so that it can be deleted.
3) The deleted datanode is removed from the pipeline. Other data of the unfinished block is written into the other two datanode that are still running normally, namenode knows that the block is still in the under-replicated state (that is, the number of backups is insufficient), and then it will arrange a new replica to meet the required number of backups, the subsequent block write methods are the same as the previous normal ones.
It is possible that multiple datanode In the MPs queue will be dropped (although not often), as long as DFS. replication. min (1 by default) replica is created, and we think the creation is successful. The remaining replica will be created asynchronously in the future to reach the specified number of replica.
After the client completes data writing, it will call the close () method (Step 6 ). This operation will flush all the remaining packages into the pipeline, wait for these packages to confirm success, and then notify namenode to write the file successfully (Step 7 ). At this time, namenode will know which blocks the file is composed of (because datastreamer assigns a new block to the namenode request, namenode will certainly know which blcoks it has allocated to the given file ), it will wait until the minimum number of replica is created, and then return successfully.

How is replica distributed?

How does hadoop select the block location when creating a new file? In general, the following factors should be taken into account: bandwidth (including write bandwidth and read bandwidth) and data security. If we put all three backups on one datanode, although we can avoid the consumption of write bandwidth, there is almost no security brought about by data redundancy, because if this datanode is used as a machine, all data in this file will be lost. In another extreme case, if we put all three redundant backups in different racks or even data centers, even though the data is secure, writing data consumes a lot of bandwidth. Hadoop 0.6.2 provides us with a default replica allocation policy (the replica policy is pluggable after hadoop 1.x, that is, you can customize your own replica allocation policy ). The default allocation policy of replica is to place the first backup on the same datanode as the client (if the client runs outside the cluster, A datanode is randomly selected to store the first replica ), the second replica is placed on a random datanode in a different rack from the first replica, and the third replica is placed on a random datanode in the same rack as the second replica. If the number of replica is greater than three, then replica will be stored randomly in the cluster. hadoop will try to avoid storing too many replica on the same rack. After the place of replica is selected, the network topology of the MPs queue is as follows:

In general, the default replica allocation policy gives us good availability (blocks is placed on two rack instances, which is safer) and write bandwidth optimization (only one rack is required for data writing ), read bandwidth optimization (you can select a closer read from two racks ).

Consistency Model

HDFS may not comply with POSIX for performance purposes (yes, you are not mistaken, POSIX is not only applicable to Linux/Unix, hadoop uses the POSIX design to read the file stream of the file system), so it may look different from what you expect.
After a file is created, it can be seen in the namespace:
Path P = New Path ("p ");
FS. Create (P );
Assertthat (FS. exists (P), is (true ));
However, any data written to this file cannot be visible. Even if you flush the data already written, the length of this file may still be zero:
Path P = New Path ("p ");
Outputstream out = FS. Create (P );
Out. Write ("content". getbytes ("UTF-8 "));
Out. Flush ();
Assertthat (FS. getfilestatus (P). getlen (), is (0l ));
This is because in hadoop, the content in this file is visible only after data with a block of data is written to the file (that is, the data will be written to the hard disk ), therefore, the content in the block being written is always invisible.
Hadoop provides a method to force the buffer content to be written to datanode, that is, the sync () method of fsdataoutputstream. After the sync () method is called, hadoop ensures that all written data is written to the datanode in the pipeline and is visible to all readers:
Path P = New Path ("p ");
Fsdataoutputstream out = FS. Create (P );
Out. Write ("content". getbytes ("UTF-8 "));
Out. Flush ();
Out. Sync ();
Assertthat (FS. getfilestatus (P). getlen (), is (long) "content". Length ())));
This method is like the fsync System Call in POSIX (it rinses all the buffered data in the given file descriptor into the disk ). For example, if you use Java APIs to write a local file, you can ensure that you can see the written content after calling flush () and synchronizing:
Fileoutputstream out = new fileoutputstream (localfile );
Out. Write ("content". getbytes ("UTF-8 "));
Out. Flush (); // flush to Operating System
Out. getfd (). Sync (); // Sync to disk (getfd () returns the file descriptor corresponding to the Stream)
Assertthat (localfile. Length (), is (long) "content". Length ())));
Disable a stream in HDFS and call the sync () method implicitly:
Path P = New Path ("p ");
Outputstream out = FS. Create (P );
Out. Write ("content". getbytes ("UTF-8 "));
Out. Close ();
Assertthat (FS. getfilestatus (P). getlen (), is (long) "content". Length ())));

Due to restrictions on the consistency model in hadoop, if we do not call the sync () method, we may lose the data of multiple blocks. This is unacceptable, so we should use the sync () method to ensure that the data has been written to the disk. However, frequent calling of the sync () method is also not good, because it will cause a lot of additional overhead. We can write a certain amount of data and call the sync () method once. The specific data size depends on your application, without affecting the performance of your application, the larger the data volume, the better.

Reprinted please indicate the source: http://www.cnblogs.com/beanmoon/archive/2012/12/17/2821548.html

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.