Hadoop detailed (iii) HDFS data flow

Source: Internet
Author: User

1. Topological distance

Here's a simple way to calculate the network topology distance of Hadoop

In a large number of scenarios, bandwidth is scarce resources, how to make full use of bandwidth, the perfect cost of computing costs and constraints are too many. Hadoop gives a solution like this:

Calculate the spacing between two nodes, using the nearest node to operate, if you are familiar with the data structure, you can see that this is an example of the distance measurement algorithm.

If represented by a data structure, this can be represented as a tree, and the distance calculation of two nodes is the calculation of the common ancestor.

In reality, the more typical scenario is as follows,

The tree structure node is represented by Datacenter Data center, here as D1, D2, rack rack, represented here as R1,R2,R3, and server node nodes, represented here as N1,n2,n3, N4

1.distance (D1/R1/N1,D1/R1/N1) =0 (same node)

2.distance (D1/R1/N1,D1/R1/N2) =2 (same rack different node)

3.distance (D1/R1/N1,D1/R2/N3) =4 (different racks in same data center)

4.distance (D1/R1/N1,D2/R3/N4) =6 (different data centers)

2. Copy storage

First of all, the Namenode node chooses a datanode node to store the block copy of the process is called copy storage, the process of the strategy is in the reliability and read and write bandwidth between the tradeoff. So let's look at two extreme phenomena:

1. Keep all copies on the same node, write bandwidth is guaranteed, but this reliability is completely false, once the node is dead, the data is all gone, and across the rack read bandwidth is very low.

2. All replicas are scattered over different nodes, reliability is improved, but bandwidth is a problem.

Even in the same data center there are many kinds of replica hosting scenarios, 0.17.0 provides a relatively balanced solution, after 1.x, the replica storage scheme is already optional.

Let's say the Hadoop default scenario:

1. Put the first copy on the same node as the client, if the client is not in the cluster, then select a node to store.

2. The second copy will be randomly selected on a different rack from the first replica

3. A third copy will randomly select a different node on the same rack as the second copy

4. The remaining copies are completely random nodes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.