Hadoop DFS. Replication

Source: Internet
Author: User

First, the DFS. Replication parameter is a client parameter, that is, the node level parameter. It must be set on each datanode.
In fact, by default, three replicas are enough and too many replicas are useless.

When a file is uploaded to HDFS, several copies are specified. After you modify the number of copies, it does not work for uploaded files. You can specify the number of copies when uploading files.
Hadoop DFS-d dfs. Replication = 1-put 70 m logs/2

You can run the following command to change the number of copies of uploaded files:
Hadoop FS-setrep-R 3/

View the number of copies of the current HDFS
Hadoop fsck-Locations
Fsck started by hadoop from/172.18.6.112 for path/at Thu Oct 27 13:24:25 CST 2011
...... Status: Healthy
Total size: 4834251860 B
Total dirs: 21
Total files: 20
Total blocks (validated): 82 (avg. Block Size 58954290 B)
Minimally replicated blocks: 82 (100.0%)
Over-replicated blocks: 0 (0.0%)
Under-replicated blocks: 0 (0.0%)
Mis-replicated blocks: 0 (0.0%)
Default replication factor: 3
Average block replication: 3.0
Upt blocks: 0
Missing replicas: 0 (0.0%)
Number of data-nodes: 3
Number of racks: 1
Fsck ended at Thu Oct 27 13:24:25 CST 2011 in 10 milliseconds
The filesystem under path '/' is healthy

Number of copies of a file, which can be viewed through the file descriptor in LS
Hadoop DFS-ls
-RW-r -- 3 hadoop supergroup 153748148/user/hadoop/logs/201108/impression_witspixel2011080100.thin.log.gz

If you only have three datanode, But you specify the number of copies as 4, it will not take effect, because each datanode can only store one copy.
Hadoop fsck-locations can see the corresponding prompt information, you can see the copy loss rate is 33.33%:
/User/hadoop/logs/test. Log: Under replicated BLK _-45151128047308146_1147. Target replicas is 4 but found 3 replica (s ).
Status: Healthy
Total size: 4834251860 B
Total dirs: 21
Total files: 20
Total blocks (validated): 82 (avg. Block Size 58954290 B)
Minimally replicated blocks: 82 (100.0%)
Over-replicated blocks: 0 (0.0%)
Under-replicated blocks: 82 (100.0%)
Mis-replicated blocks: 0 (0.0%)
Default replication factor: 3
Average block replication: 3.0
Upt blocks: 0
Missing replicas: 82 (33.333332%)
Number of data-nodes: 3
Number of racks: 1
Fsck ended at Thu Oct 27 13:22:14 CST 2011 in 12 milliseconds
Reference: hdfs_design
Http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.pdf
Http://hadoop.apache.org/common/docs/r0.20.204.0/hdfs_design.html
When a file is uploaded, the client does not immediately contact namenode, but caches data locally. When HDFS block size is used, the client contacts namenode and namenode inserts the file name into the file system structure, A data block is allocated for a period of time. Namenode uses the datanode host name and data block location to request the corresponding client. The client refreshes data from the local temporary file to the specified datanode. When the file is closed, temporary files that are not refreshed will be transmitted to datanode, and the client notifies the namenode file to be closed. In this case, namenode submits the file creation operation to permanent storage. If namenode is die before file closes, the file is lost.

Create a copy
when the client writes a file to HDFS, as mentioned earlier, first write the file to a local temporary file. Assume that the copy coefficient of HDFS is set to 3. when the cached file reaches the HDFS block size, the client retrieves a datanode list from the namenode. This list contains the datanode list of the copy of the host. The client refreshes the data to the first datanode in the list. The first datanode receives data in 4 kb, writes the data locally, and transmits it to the second datanode In the list. The second datanode also performs the same operation. A datanode can obtain data from the previous data pipeline and send the data to the next data pipeline.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.