Catalogue
- What is HDFs?
- Advantages and disadvantages of HDFs
- The framework of HDFs
- HDFs Read and write process
- HDFs command
- HDFs parameters
1. What is HDFs
The HDFS (Hadoop Distributed File System) is the core subproject of a Hadoop project, first It is a file system for storing files, locating file locations through a directory tree, and secondly, it is distributed , Many servers are federated to implement their functions, and the servers in the cluster have their own roles.
Advantages and disadvantages of 2.HDFS
The choice of HDFs to store data has the following advantages:
No |
Advantage |
Describe |
1 |
High level of fault tolerance |
- Data automatically saves multiple copies. It improves fault tolerance by increasing the form of replicas.
- Once a copy is lost, it can be recovered automatically, which is implemented by the HDFS internal mechanism, which we do not have to care about.
|
2 |
Suitable for batch processing |
- It is calculated by moving the data rather than moving it.
- It exposes the data location to the computing framework.
|
3 |
Suitable for large data processing |
- Process data up to GB, TB, or even petabytes of data.
- The number of documents capable of processing millions or more is quite large.
- Ability to handle the size of 10K nodes.
|
4 |
Streaming file access |
- Write once, read multiple times. Once the file is written, it cannot be modified, only appended.
- It guarantees the consistency of the data.
|
5 |
Can be built on a cheap machine |
- It improves reliability through a multi-copy mechanism.
- It provides a fault tolerance and recovery mechanism. For example, if one copy is lost, it can be recovered by other copies.
|
HDFs also has a scenario that is not suitable:
No |
Disadvantages |
Describe |
1 |
Low Latency Data access |
- such as milliseconds to store data, this is not possible, it does not.
- It is suitable for high throughput scenarios where a large amount of data is written at a given time. But it does not work in low-latency situations, such as reading data within milliseconds, so it is difficult to do so.
|
2 |
Small file storage |
- Storing a large number of small files (the small file here refers to a file that is smaller than the block size of the HDFS system (default 64M)), it consumes namenode large amounts of memory to store files, directories, and block information. This is undesirable because the memory of Namenode is always limited.
- The seek time for small file storage exceeds the read time, and it violates the design goal of HDFs.
|
3 |
Concurrent write, File random modification |
- A file can have only one write, and multiple threads are not allowed to write at the same time.
- Only data append (append) is supported and random modification of files is not supported.
|
3. HDFS Frame Structure
HDFs uses a master/slave architecture to store data, which consists of four parts, the HDFs Client, NameNode, Datanode, and secondary NameNode, respectively. Here we introduce the four components separately.
No |
Role |
Function description |
1 |
Client: Clients |
- File segmentation. When uploading an HDFS file, the Client divides the file into one block and then stores it.
- Interact with NameNode to get the location information for the file.
- Interacts with DataNode to read or write data.
- The Client provides commands to manage HDFs, such as starting or closing HDFs.
- The Client can access HDFS through a number of commands.
|
2 |
NameNode: It's master, it's a supervisor, a manager. |
- Managing the name space of HDFS
- Managing Data Block (block) mapping information
- Configure replica Policy
- Handles client read and write requests.
|
3 |
DataNode: it's slave. NameNode release the command, DataNode perform the actual operation |
- Stores the actual block of data.
- Performs a read/write operation on a block of data.
|
4 |
Secondary NameNode: Not a hot preparation for NameNode. When NameNode hangs, it does not immediately replace the NameNode and provide services |
- Assist NameNode and share its workload.
- Merge Fsimage and Fsedits regularly and push them to Namenode.
- In emergency situations, the recovery of NameNode can be assisted.
|
4. HDFs Read and write process
4.1. HDFs Block Size
The file in HDFs is physically chunked (block), the size of the block can be specified by configuration parameters (dfs.blocksize), the default size in the hadoop2.x version is 128M, the old version is 64M
The block of HDFs is larger than the disk block, and is intended to minimize addressing overhead. If the block is set large enough, the time to transfer data from the disk is significantly greater than the time it takes to locate the block's starting position. Thus, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.
If the addressing time is about 10ms and the transfer rate is 100mb/s, in order to make the addressing time only 1% of the transmission time, we will set the block size to about 100MB. the default block size is 128MB.
Block Size: 10ms*100*100m/s = 100M
4.2. HDFs Write Data Flow
1) The client requests to Namenode to upload the file, Namenode checks to see if the target file exists, and the parent directory exists.
2) Namenode Returns whether it can be uploaded.
3) The client requests the first block to upload to a few Datanode servers.
4) Namenode returns 3 Datanode nodes, DN1, DN2, DN3, respectively.
5) The client request DN1 upload data, DN1 receive the request will continue to call DN2, and then DN2 call DN3, the communication pipeline is established to complete
6) DN1, DN2, Dn3 Step-by-step response client
7) The client begins to upload the first block to DN1 (the first to read the data from the disk into a local memory cache), in packet, DN1 receives a packet will be passed to DN2,DN2 to dn3;dn1 each pass packet will be put into a reply queue waiting to be answered
8) when a block transfer is complete, the client requests Namenode to upload the second block server again. (Repeat 3-7 steps)
4.3. HDFs read Data Flow
1) The client requests to Namenode to download the file, Namenode by querying the metadata to find the Datanode address where the file block resides.
2) Select a Datanode (nearest principle, then random) server, request to read the data.
3) Datanode begins transmitting data to the client (reads data from the disk into the stream and checks it in packet).
4) The client is received in packet, first cached locally and then written to the destination file.
5.HDFS command
1) Basic syntax
Bin/hadoop FS Specific commands
2) Common Command real
(1)-help: output This command parameter
Bin/hdfs dfs-help RM
(2)-ls: Display directory information
Hadoop fs-ls/
(3)-mkdir: Create a directory on HDFs
Hadoop fs-mkdir-p/aaa/bbb/cc/dd
(4)-movefromlocal from local cut paste to HDFs
Hadoop fs-movefromlocal/home/hadoop/a.txt/aaa/bbb/cc/dd
(5)-movetolocal: Cut paste from HDFs to local (not yet implemented)
Hadoop fs-help movetolocal
-movetolocal <src> <localdst>:
Not implemented yet
(6)--appendtofile: Append a file to the end of a file that already exists
Hadoop fs-appendtofile./hello.txt/hello.txt
(7)-cat: Display file contents
(8)-tail: Displays the end of a file
Hadoop fs-tail/weblog/access_log.1
(9)-chgrp,-chmod,-chown:linux file system, the same as the use of modified file ownership permissions
Hadoop fs-chmod 666/hello.txt
Hadoop fs-chown someuser:somegrp/hello.txt
-copyfromlocal: Copy files from the local file system to the HDFs path
Hadoop fs-copyfromlocal./jdk.tar.gz/aaa/
(one)-copytolocal: Copy from HDFs to local
Hadoop fs-copytolocal/user/hello.txt./hello.txt
-CP: Copy one path from HDFs to another path in HDFs
Hadoop fs-cp/aaa/jdk.tar.gz/bbb/jdk.tar.gz.2
-MV: Moving files in the HDFs directory
Hadoop fs-mv/aaa/jdk.tar.gz/
-get: Equivalent to Copytolocal, which is to download files from HDFs to local
Hadoop fs-get/user/hello.txt./
(-getmerge): Merge to download multiple files, such as HDFs directory/aaa/There are multiple files: log.1, log.2,log.3,...
Hadoop fs-getmerge/aaa/log.*./log.sum
(+)-put: equivalent to Copyfromlocal
Hadoop fs-put/aaa/jdk.tar.gz/bbb/jdk.tar.gz.2
-RM: Deleting files or folders
Hadoop fs-rm-r/aaa/bbb/
(-rmdir): Delete Empty directory
Hadoop FS-RMDIR/AAA/BBB/CCC
-DF: Free space information for the statistical file system
Hadoop fs-df-h/
(-DU) Size information for the statistics folder
Hadoop fs-du-s-h/user/data/wcinput
188.5 M/user/data/wcinput
Hadoop fs-du-h/user/data/wcinput
188.5 m/user/data/wcinput/hadoop-2.7.2.tar.gz
97/user/data/wcinput/wc.input
-count: Count the number of file nodes in a specified directory
Hadoop fs-count/aaa/
Hadoop fs-count/user/data/wcinput
1 2 197657784/user/data/wcinput
nested file levels; Total number of included files
-setrep: Set the number of copies of files in HDFs
Hadoop fs-setrep 3/aaa/jdk.tar.gz
The number of copies set here is only recorded in the Namenode metadata, if there is really so many copies, but also to see the number of Datanode. If only 3 devices, up to 3 copies, only the number of nodes increased to 10, the number of replicas can reach 10.
6.HDFS Related parameters
No |
Parameter name |
Default value |
Owning parameter file |
Describe |
1 |
Dfs.block.size, Dfs.blocksize |
134217728 |
Hdfs-site.xml |
The default block size for new HDFS files computed in bytes. Note that this value is also used as the HBase Zone server HLog block size. |
2 |
Dfs.replication |
3 |
Hdfs-site.xml |
The number of copies of the data block of the HDFs file. |
3 |
Dfs.webhdfs.enabled |
TRUE |
Hdfs-site.xml |
Enable the Webhdfs interface to start Port 50070. |
4 |
Dfs.permissions |
TRUE |
Hdfs-site.xml |
HDFs file permission check. |
5 |
dfs.datanode.failed.volumes.tolerated |
0 |
Hdfs-site.xml |
The maximum number of bad drives that can cause the DN to be hung, the default 0 is that the DN will be shutdown as long as 1 hard drives are broken. |
6 |
Dfs.data.dir, Dfs.datanode.data.dir |
Xxx,xxx |
Hdfs-site.xml |
Datanode data save path, can write multiple hard disks, comma separated |
7 |
Dfs.name.dir, Dfs.namenode.name.dir |
Xxx,xxx |
Hdfs-site.xml |
Namenode local Metadata Store directory, can write multiple hard disks, comma separated |
8 |
Fs.trash.interval |
1 |
Core-site.xml |
The garbage bin check frequency (minutes). To disable the garbage bin feature, enter 0. |
9 |
Dfs.safemode.min.datanodes |
0 |
Hdfs-site.xml |
Specifies the number of datanodes that must be active before the name node exists safemode. Enter a value that is less than or equal to 0 to take into account the number of datanodes of the activity when deciding whether to retain SafeMode during startup. SafeMode is persisted when the value is greater than the number of datanodes in the cluster. |
10 |
Dfs.client.read.shortcircuit |
TRUE |
Hdfs-site.xml |
Enable HDFS short circuit read. This operation allows the client to read the HDFS file block directly with DataNode. This can improve the performance of localized distributed clients |
11 |
Dfs.datanode.handler.count |
3 |
Hdfs-site.xml |
DataNode the number of server threads. The default is 3, larger clusters, can be properly adjusted, such as 8. |
12 |
Dfs.datanode.max.xcievers, Dfs.datanode.max.transfer.threads |
256 |
Hdfs-site.xml |
Specifies the maximum number of threads that are used to transmit data inside and outside DataNode, DataNode the maximum number of threads for file transfers |
13 |
Dfs.balance.bandwidthPerSec, Dfs.datanode.balance.bandwidthPerSec |
1048576 |
Hdfs-site.xml |
Each DataNode can be used to balance the maximum bandwidth. Units are in bytes/sec |
The above parameters may have 2 names, the previous one is the old version of 1.x behind the new version of 2.x.
The above information is copied from the network, if you mind please inform
The principle and framework of the first knowledge of HDFs