The principle and framework of the first knowledge of HDFs

Source: Internet
Author: User
Tags hdfs dfs hadoop fs

Catalogue
    1. What is HDFs?
    2. Advantages and disadvantages of HDFs
    3. The framework of HDFs
    4. HDFs Read and write process
    5. HDFs command
    6. HDFs parameters
1. What is HDFs

The HDFS (Hadoop Distributed File System) is the core subproject of a Hadoop project, first It is a file system for storing files, locating file locations through a directory tree, and secondly, it is distributed , Many servers are federated to implement their functions, and the servers in the cluster have their own roles.

Advantages and disadvantages of 2.HDFS

The choice of HDFs to store data has the following advantages:

No Advantage Describe
1 High level of fault tolerance
  • Data automatically saves multiple copies. It improves fault tolerance by increasing the form of replicas.
  • Once a copy is lost, it can be recovered automatically, which is implemented by the HDFS internal mechanism, which we do not have to care about.
2 Suitable for batch processing
  • It is calculated by moving the data rather than moving it.
  • It exposes the data location to the computing framework.
3 Suitable for large data processing
  • Process data up to GB, TB, or even petabytes of data.
  • The number of documents capable of processing millions or more is quite large.
  • Ability to handle the size of 10K nodes.
4 Streaming file access
  • Write once, read multiple times. Once the file is written, it cannot be modified, only appended.
  • It guarantees the consistency of the data.
5 Can be built on a cheap machine
  • It improves reliability through a multi-copy mechanism.
  • It provides a fault tolerance and recovery mechanism. For example, if one copy is lost, it can be recovered by other copies.

HDFs also has a scenario that is not suitable:

No Disadvantages Describe
1 Low Latency Data access
  • such as milliseconds to store data, this is not possible, it does not.
  • It is suitable for high throughput scenarios where a large amount of data is written at a given time. But it does not work in low-latency situations, such as reading data within milliseconds, so it is difficult to do so.
2 Small file storage
  • Storing a large number of small files (the small file here refers to a file that is smaller than the block size of the HDFS system (default 64M)), it consumes namenode large amounts of memory to store files, directories, and block information. This is undesirable because the memory of Namenode is always limited.
  • The seek time for small file storage exceeds the read time, and it violates the design goal of HDFs.
3 Concurrent write, File random modification
  • A file can have only one write, and multiple threads are not allowed to write at the same time.
  • Only data append (append) is supported and random modification of files is not supported.
3. HDFS Frame Structure

HDFs uses a master/slave architecture to store data, which consists of four parts, the HDFs Client, NameNode, Datanode, and secondary NameNode, respectively. Here we introduce the four components separately.

No Role Function description
1 Client: Clients
  • File segmentation. When uploading an HDFS file, the Client divides the file into one block and then stores it.
  • Interact with NameNode to get the location information for the file.
  • Interacts with DataNode to read or write data.
  • The Client provides commands to manage HDFs, such as starting or closing HDFs.
  • The Client can access HDFS through a number of commands.
2 NameNode: It's master, it's a supervisor, a manager.
  • Managing the name space of HDFS
  • Managing Data Block (block) mapping information
  • Configure replica Policy
  • Handles client read and write requests.
3 DataNode: it's slave. NameNode release the command, DataNode perform the actual operation
  • Stores the actual block of data.
  • Performs a read/write operation on a block of data.
4 Secondary NameNode: Not a hot preparation for NameNode. When NameNode hangs, it does not immediately replace the NameNode and provide services
  • Assist NameNode and share its workload.
  • Merge Fsimage and Fsedits regularly and push them to Namenode.
  • In emergency situations, the recovery of NameNode can be assisted.
4. HDFs Read and write process

4.1. HDFs Block Size

The file in HDFs is physically chunked (block), the size of the block can be specified by configuration parameters (dfs.blocksize), the default size in the hadoop2.x version is 128M, the old version is 64M

The block of HDFs is larger than the disk block, and is intended to minimize addressing overhead. If the block is set large enough, the time to transfer data from the disk is significantly greater than the time it takes to locate the block's starting position. Thus, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.

If the addressing time is about 10ms and the transfer rate is 100mb/s, in order to make the addressing time only 1% of the transmission time, we will set the block size to about 100MB. the default block size is 128MB.

Block Size: 10ms*100*100m/s = 100M

4.2. HDFs Write Data Flow

1) The client requests to Namenode to upload the file, Namenode checks to see if the target file exists, and the parent directory exists.

2) Namenode Returns whether it can be uploaded.

3) The client requests the first block to upload to a few Datanode servers.

4) Namenode returns 3 Datanode nodes, DN1, DN2, DN3, respectively.

5) The client request DN1 upload data, DN1 receive the request will continue to call DN2, and then DN2 call DN3, the communication pipeline is established to complete

6) DN1, DN2, Dn3 Step-by-step response client

7) The client begins to upload the first block to DN1 (the first to read the data from the disk into a local memory cache), in packet, DN1 receives a packet will be passed to DN2,DN2 to dn3;dn1 each pass packet will be put into a reply queue waiting to be answered

8) when a block transfer is complete, the client requests Namenode to upload the second block server again. (Repeat 3-7 steps)

4.3. HDFs read Data Flow

1) The client requests to Namenode to download the file, Namenode by querying the metadata to find the Datanode address where the file block resides.

2) Select a Datanode (nearest principle, then random) server, request to read the data.

3) Datanode begins transmitting data to the client (reads data from the disk into the stream and checks it in packet).

4) The client is received in packet, first cached locally and then written to the destination file.

5.HDFS command

1) Basic syntax

Bin/hadoop FS Specific commands

2) Common Command real

(1)-help: output This command parameter

Bin/hdfs dfs-help RM

(2)-ls: Display directory information

Hadoop fs-ls/

(3)-mkdir: Create a directory on HDFs

Hadoop fs-mkdir-p/aaa/bbb/cc/dd

(4)-movefromlocal from local cut paste to HDFs

Hadoop fs-movefromlocal/home/hadoop/a.txt/aaa/bbb/cc/dd

(5)-movetolocal: Cut paste from HDFs to local (not yet implemented)

Hadoop fs-help movetolocal

-movetolocal <src> <localdst>:

Not implemented yet

(6)--appendtofile: Append a file to the end of a file that already exists

Hadoop fs-appendtofile./hello.txt/hello.txt

(7)-cat: Display file contents

(8)-tail: Displays the end of a file

Hadoop fs-tail/weblog/access_log.1

(9)-chgrp,-chmod,-chown:linux file system, the same as the use of modified file ownership permissions

Hadoop fs-chmod 666/hello.txt

Hadoop fs-chown someuser:somegrp/hello.txt

-copyfromlocal: Copy files from the local file system to the HDFs path

Hadoop fs-copyfromlocal./jdk.tar.gz/aaa/

(one)-copytolocal: Copy from HDFs to local

Hadoop fs-copytolocal/user/hello.txt./hello.txt

-CP: Copy one path from HDFs to another path in HDFs

Hadoop fs-cp/aaa/jdk.tar.gz/bbb/jdk.tar.gz.2

-MV: Moving files in the HDFs directory

Hadoop fs-mv/aaa/jdk.tar.gz/

-get: Equivalent to Copytolocal, which is to download files from HDFs to local

Hadoop fs-get/user/hello.txt./

(-getmerge): Merge to download multiple files, such as HDFs directory/aaa/There are multiple files: log.1, log.2,log.3,...

Hadoop fs-getmerge/aaa/log.*./log.sum

(+)-put: equivalent to Copyfromlocal

Hadoop fs-put/aaa/jdk.tar.gz/bbb/jdk.tar.gz.2

-RM: Deleting files or folders

Hadoop fs-rm-r/aaa/bbb/

(-rmdir): Delete Empty directory

Hadoop FS-RMDIR/AAA/BBB/CCC

-DF: Free space information for the statistical file system

Hadoop fs-df-h/

(-DU) Size information for the statistics folder

Hadoop fs-du-s-h/user/data/wcinput

188.5 M/user/data/wcinput

Hadoop fs-du-h/user/data/wcinput

188.5 m/user/data/wcinput/hadoop-2.7.2.tar.gz

97/user/data/wcinput/wc.input

-count: Count the number of file nodes in a specified directory

Hadoop fs-count/aaa/

Hadoop fs-count/user/data/wcinput

1 2 197657784/user/data/wcinput

nested file levels; Total number of included files

-setrep: Set the number of copies of files in HDFs

Hadoop fs-setrep 3/aaa/jdk.tar.gz

The number of copies set here is only recorded in the Namenode metadata, if there is really so many copies, but also to see the number of Datanode. If only 3 devices, up to 3 copies, only the number of nodes increased to 10, the number of replicas can reach 10.

6.HDFS Related parameters

No Parameter name Default value Owning parameter file Describe
1

Dfs.block.size,

Dfs.blocksize

134217728 Hdfs-site.xml The default block size for new HDFS files computed in bytes. Note that this value is also used as the HBase Zone server HLog block size.
2 Dfs.replication 3 Hdfs-site.xml The number of copies of the data block of the HDFs file.
3 Dfs.webhdfs.enabled TRUE Hdfs-site.xml Enable the Webhdfs interface to start Port 50070.
4 Dfs.permissions TRUE Hdfs-site.xml HDFs file permission check.
5 dfs.datanode.failed.volumes.tolerated 0 Hdfs-site.xml The maximum number of bad drives that can cause the DN to be hung, the default 0 is that the DN will be shutdown as long as 1 hard drives are broken.
6

Dfs.data.dir,

Dfs.datanode.data.dir

Xxx,xxx Hdfs-site.xml Datanode data save path, can write multiple hard disks, comma separated
7

Dfs.name.dir,

Dfs.namenode.name.dir

Xxx,xxx Hdfs-site.xml Namenode local Metadata Store directory, can write multiple hard disks, comma separated
8 Fs.trash.interval 1 Core-site.xml The garbage bin check frequency (minutes). To disable the garbage bin feature, enter 0.
9 Dfs.safemode.min.datanodes 0 Hdfs-site.xml Specifies the number of datanodes that must be active before the name node exists safemode. Enter a value that is less than or equal to 0 to take into account the number of datanodes of the activity when deciding whether to retain SafeMode during startup. SafeMode is persisted when the value is greater than the number of datanodes in the cluster.
10 Dfs.client.read.shortcircuit TRUE Hdfs-site.xml Enable HDFS short circuit read. This operation allows the client to read the HDFS file block directly with DataNode. This can improve the performance of localized distributed clients
11 Dfs.datanode.handler.count 3 Hdfs-site.xml DataNode the number of server threads. The default is 3, larger clusters, can be properly adjusted, such as 8.
12 Dfs.datanode.max.xcievers, Dfs.datanode.max.transfer.threads 256 Hdfs-site.xml Specifies the maximum number of threads that are used to transmit data inside and outside DataNode, DataNode the maximum number of threads for file transfers
13 Dfs.balance.bandwidthPerSec, Dfs.datanode.balance.bandwidthPerSec 1048576 Hdfs-site.xml Each DataNode can be used to balance the maximum bandwidth. Units are in bytes/sec

The above parameters may have 2 names, the previous one is the old version of 1.x behind the new version of 2.x.

The above information is copied from the network, if you mind please inform

The principle and framework of the first knowledge of HDFs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.