Hadoop read and write documents internal working mechanism is like?

Last Update:2014-12-22 Source: Internet

Author: User

Keywords name we can write dfs

Tags aliyun backup bandwidth block calling calls client close

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reading documents

& http: //www.aliyun.com/zixun/aggregation/37954.html "> nbsp; read the file internal working mechanism See below:

The client opens the file (that is, the first step in the figure) by calling the open () method of the FileSystem object (corresponding to the HDFS file system, calling the DistributedFileSystem object). DistributedFileSystem obtains this by calling the RPC File the first few blocks of the file location (second step). For each block, namenode returns the address information for all namenodes that own the block backup (sorted by the distance to the clients in the cluster's topology network, see How to Perform Network Topologies in Hadoop Clusters below). If the client itself is a datanode (if the client is a mapreduce task) and the datanode itself has the required file block, the client reads the file locally.
After the above steps, DistributedFileSystem will return a FSDataInputStream (support file seek), the client can read data from the FSDataInputStream. FSDataInputStream wraps a DFSInputSteam class that handles I / O operations for namenode and datanode.
The client then executes the read () method (step 3), and DFSInputStream (which already stores the location information for the first few blocks of the file to be read) connects to the first datanode (ie, the most recent datanode) to retrieve the data. By repeatedly calling the read () method (steps 4 and 5), the data in the file is streamed to the client. When the end of the block is read, DFSInputStream closes the stream that points to the block and instead finds the location of the next block and then repeatedly calls the read () method to continue streaming the block. These processes are transparent to the user and appear to the user as an uninterrupted, streamlined reading of the entire file.
When a file is really read, the client calls the close () method in FSDataInputSteam to close the file input stream (step 6).
If an error is detected while reading a block from DFSInputStream, DFSInputSteam connects to the next datanode for additional backups of this block, and he will record the previously detected broken datanode to avoid future repeated reads of the datanode. DFSInputSteam also checks the checksum of the data read from the datanode. If it finds any data corruption, it reports the broken block to namenode and rereads the other block backups on other datanodes.
One benefit of this design pattern is that the file reads across the datanode of the cluster. The namenode simply provides the location information for the file block, which requires very little bandwidth, effectively avoiding single-point bottlenecks Thus can expand the scale of the cluster more greatly.

Network topology in Hadoop

How to measure the distance between two nodes in a Hadoop cluster? Be aware that the only limiting factor to the data processing speed at which data is processed at high speed is the data transfer rate between different nodes: this is caused by the horrible lack of bandwidth. So we use the bandwidth as a measure of the distance between two nodes.
However, calculating the bandwidth between two nodes is complicated and it needs to be measured under a static cluster. However, the Hadoop cluster generally changes dynamically according to the size of the data processing (and the number of connections directly connected between two nodes Is the square of the number of nodes). So Hadoop uses a simple method to measure distance, which represents the network in the cluster as a tree structure, and the distance between two nodes is the sum of their distance from common ancestor nodes. Trees are generally organized as datacenters, racks, and datanodes. Compute nodes have the fastest local operations and slowest cross-data centers (Hadoop clusters are now used sporadically across data centers, often operating within a single data center).
If there is a compute node n1 in rack r1 of data center c1, it can be expressed as / c1 / r1 / n1. The following is the distance between two nodes in different cases:
• distance (/ d1 / r1 / n1, / d1 / r1 / n1) = 0 (processes on the same node)
• distance (/ d1 / r1 / n1, / d1 / r1 / n2) = 2 (different nodes on the same rack)
• distance (/ d1 / r1 / n1, / d1 / r2 / n3) = 4 (nodes on different racks in the same data center)
• distance (/ d1 / r1 / n1, / d2 / r3 / n4) = 6 (nodes in different data centers)
As shown below:

Write a document

Now let's take a look at the mechanism for writing files in Hadoop. By writing a file mechanism, we can better understand the consistency model in Hadoop.

The image above shows an example of creating a new file and writing data to it.
First, the client specifies the file name of a file to be created through the create () method on the DistributedFileSystem (first step). The DistributedFileSystem then requests the NameNode to create a new file through an RPC call (in the second step, the file is not yet allocated The corresponding block). namenode Check whether the file with the same name exists and whether the user has the corresponding creation permission. If the check is passed, the namenode will create a new record for the file. Otherwise, the file creation fails and the client gets an IOException. DistributedFileSystem returns a FSDataOutputStream for writing data to the client. Like FSDataInputStream, FSDataOutputStream encapsulates a DFSOutputStream to handle the communication between namenode and datanode.
When the client starts to write data (the third step), DFSOutputStream divides the written data into packets, placing it in an intermediate queue, the data queue. DataStreamer fetches the data from the data queue and at the same time asksn the namenode for a new block to hold the data it has already made. namenode selects a series of suitable datanodes (the number of which is determined by the number of replicas in the file) to form a pipeline. Here we assume that the replica is 3, so there are three datanodes in the pipeline. DataSteamer writes the data stream to the first datanode in the pipeline (step 4). The first datanode then passes the received data to the second datanode (step 4), and so on .
DFSOutputStream also maintains another intermediate queue, the ack queue, that confirms that the packets in the queue are removed from the acknowledgment queue only after being acknowledged by all datanodes in the pipeline (step 5).
If a datanode dies while writing data, the following user-transparent steps are executed:
1) The pipeline is closed, all the data in the confirmation queue will be moved to the head of the data queue to resend, so as to ensure that the dropped Datanode downstream datanode in the pipeline will not lose the packet due to the dropped datanode.
2) In the normal operation of the datanode on the current block to do a flag, so Dangdang datanode restart after the namenode will know which datanode block is just left when the machine was damaged locally under the block, which can be It deleted.
3) The dropped datanode is removed from the pipelines, and the rest of the unfinished block continues to be written to the other two still running datanodes, and the namenode knows that the block is still under-replicated State (that is, the state of insufficient backup), and then he will arrange a new replica to achieve the required number of backups, follow-up block write method as in the previous normal time.
It is possible that multiple datanodes in the pipeline will be dropped (albeit infrequently), but we think the creation was successful as long as the dfs.replication.min (defaults to 1) replicas were created. The remaining replicas are created asynchronously later to reach the specified number of replicas.
When the client finishes writing data, it calls the close () method (step 6). This operation will flush all the remaining packages to the pipeline, wait for these packages to confirm the success, and then notify the namenode to write the file successfully (seventh step). This time the namenode knows which blocks the file is made of (because DataStreamer allocates a new block to the namenode request and the namenode will of course know which blcok it has been assigned to the given file), it will wait for the least number of replicas to be created and return successfully.

How replica is distributed

Hadoop how to choose the location of the block when creating a new file, in general, to consider the following factors: bandwidth (including write bandwidth and read bandwidth) and data security. If we put all three backups together on a single datanode, we can barely provide the security of data redundancy while avoiding write bandwidth consumption, because if this datanode crashes, then all the data for that file is all Lost At the other extreme, if you put all three redundant backups in a different rack, or even in a data center, that data will be safe, writing data consumes a lot of bandwidth. Hadoop 0.17.0 gives us a default replica allocation strategy (Hadoop 1.X allows the replica policy later is pluggable, that is, you can make their own replica distribution strategy). The default replica distribution strategy is to place the first backup on the same datanode as the client (randomly select a datanode to hold the first replica if the client is running outside the cluster) and the second replica on the same datanode as the client A replica on a different rack of a random datanode, the third replica placed on the same rack as the second replica on a random datanode. If the replica number is greater than three, then the replica is stored randomly in the cluster, Hadoop will try to avoid too much replica stored in the same rack. After selecting the placement of the replica, the network topology of the pipeline is as follows:

Overall, the above default replica allocation strategy gives us good usability (blocks are placed on both racks and is safer), write bandwidth optimization (writing data only needs to span one rack), read bandwidth optimization Choose one of the two racks closer to read).

Consistency model

Some parts of HDFS may not be POSIX compliant for performance (yes, you're right, POSIX is not just for linux / unix, but for Hadoop, POSIX is designed to read file system file streams) so it It may look different from what you expect, be aware.
After creating a file, it can be seen in the namespace:
Path p = new Path ("p");
fs.create (p);
assertThat (fs.exists (p), is (true));
However, any data written to this file is not guaranteed to be visible, and the length of this file may remain zero even if you flush the data that has been written:
Path p = new Path ("p");
OutputStream out = fs.create (p);
out.write ("content" .getBytes ("UTF-8"));
out.flush ();
assertThat (fs.getFileStatus (p) .getLen (), is (0L));
This is because in Hadoop, the contents of this file are only visible after the data of one block of data has been written to the file (ie the data is written to the hard disk), so the data currently being written The contents of the block are always invisible.
Hadoop provides a way to force the contents of the buffer into datanode, which is the sync () method of FSDataOutputStream. After calling the sync () method, Hadoop ensures that all data that has already been written is flushed into the datanode in the pipeline and is visible to all readers:
Path p = new Path ("p");
FSDataOutputStream out = fs.create (p);
out.write ("content" .getBytes ("UTF-8"));
out.flush ();
out.sync ();
assertThat (fs.getFileStatus (p) .getLen (), is (((long) "content" .length ())));
This method is like the fsync system call in POSIX (which flushes all buffered data in a given file descriptor to disk). For example, to write a local file using the java API, we can guarantee that written content is visible after flush () and synchronization are invoked:
FileOutputStream out = new FileOutputStream (localFile);
out.write ("content" .getBytes ("UTF-8"));
out.flush (); // flush to operating system
out.getFD (). sync (); // sync to disk (getFD () returns the file descriptor corresponding to this stream)
assertThat (localFile.length (), is (((long) "content" .length ())));
Closing a stream in HDFS implicitly calls the sync () method:
Path p = new Path ("p");
OutputStream out = fs.create (p);
out.write ("content" .getBytes ("UTF-8"));
out.close ();
assertThat (fs.getFileStatus (p) .getLen (), is (((long) "content" .length ())));

Because of the consistency model in Hadoop, if we do not call the sync () method, we are more likely to lose data for one block at a time. This is unacceptable, so we should use the sync () method to make sure the data is already written to disk. However, calling the sync () method frequently is not good because of the extra overhead. We can write a certain amount of data and then call the sync () method once, as the size of this specific data will be based on your application, and without affecting the performance of your application, the amount of data The bigger should be the better.

Reproduced, please indicate the source: http: //www.cnblogs.com/beanmoon/archive/2012/12/17/2821548.html

I guess you'll like:

1. Use Linux and Hadoop for distributed computing

Hadoop architecture design, operation principle explain

3.Hadoop 2.3.0 to solve what problems

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More