I. Background of the advent of HDFs
With the progress of the society, the need to deal with more and more data, in the scope of an operating system is not enough, then allocated to more operating system management of the disk, but it is not easy to manage and maintain, therefore, there is an urgent need for a system to manage the files on more than one machine, A distributed file management system was created, and the English name became DFS(Distributed File System).
So, what is a distributed file system? In short, it is a file system that allows files to be shared across multiple hosts on a network, allowing multiple users on multiple machines to share files and storage space . Its biggest feature is "permeability", Dfs let is actually through the network to access the file action, by the user and the program, as if to access the local Disk general (in the other words, using DFS access to data, You don't feel like you're accessing data from different machines on the remote.
Figure 1. A typical DFS example
Two. In-depth understanding of the HDFS principle
As one of the core technologies of Hadoop, HDFS (Hadoop Distributed file System,hadoop distributed files System) is the basis of data storage management in distributed computing. Its high-fault-tolerant, high-reliability, high-scalability, high-throughput, and other features provide a robust storage for massive data, as well as a lot of convenience for application processing of ultra-large datasets (Large data set).
Figure 2. The logo of Hadoop HDFs
Referring to HDFs, I had to say Google's gfs. It was Google's thesis on GFs that made HDFs the open-source implementation of GFS.
2.1 Design Prerequisites and Objectives
(1) Hardware errors are normal rather than abnormal; (HDFs, the core design goal, is designed to run on a wide range of common hardware, so hardware failures are normal.) Therefore, error detection and rapid recovery are the most core design objectives of HDFS.
(2) streaming data access; (HDFs pays more attention to the high throughput of data access)
(3) Large-scale datasets; (most typical file sizes in HDFs are in gigabytes or even terabytes)
(4) Simple consistency model ( one-write, multiple-read access mode)
(5) Mobile computing is more cost-effective than mobile data; (For large files, mobile computing is less expensive than moving data)
Architecture of the 2.2 HDFs
HDFs is a master/slave (Master/slave)- like structure, as shown in.
Figure 3. Basic architecture of HDFs
From an end-user perspective, it is like a traditional file system, where you can use a directory path to perform crud (additions and deletions) to files. But because of the nature of distributed storage, HDFs has a namenode and some datanodes. Namenode manages the metadata of the file system, Datanode stores the actual data. The client interacts with Namenode and Datanode to access the file system → The client contacts namenode to get the metadata of the file, and the real I/O operation is directly interacting with the Datanode.
Let's take a look at the process of read and write operations for HDFs:
① Read operation
Figure 4. Read operations for HDFs
To access a file, the client first obtains a list of the file data block locations from Namenode, that is, which datanode the data block is stored on, and the client reads the file data directly from the Datanode. During this process, Namenode does not participate in the transfer of files.
② Write operations
Figure 5. Write operations for HDFs
The client first needs to initiate a write request to Namenode, and Namenode returns information to the client that it manages Datanode based on the file size and file block configuration. Finally, the Client (development library) divides the file into blocks of files, which are written sequentially to each Datanode block according to the Datanode address information.
Let's take a look at what role Namenode and Datanode play, and what the specific role is:
(1)NameNode
The role of Namenode is to manage the file directory structure and to manage the data nodes. Namenode maintains two sets of data : One is the relationship between the file directory and the data block , and the other is the relationship between the data block and the node . The previous set is static, is stored on the disk, through the fsimage and edits files to maintain, the latter set of data is dynamic, not persisted to the disk, each time the cluster starts, the information will be automatically created.
(2)DataNode
There is no doubt that Datanode is the real storage of data in HDFs. One thing to mention here is block (block of data). Assuming that the file size is 100GB, starting at byte position 0, each 64MB byte is divided into a block, and so on, can be divided into a lot of block. Each block is 64MB (you can also customize the block size).
(3) Typical deployment
A typical deployment of HDFs is to run Namenode on a dedicated machine, and the other machines in the cluster run a datanode. (Of course, you can also run Datanode on a machine running Namenode, or run multiple datanode on a machine) with only one namenode in a cluster (but single namenode has a single point of issue in Hadoop 2.) The design that solves this problem after the X version greatly simplifies the system architecture.
2.3 Reliability measures for the protection of HDFs
HDFs has a more sophisticated redundancy backup and recovery mechanism, can be implemented in the cluster of reliable storage of large volumes of files.
(1) redundant backup : HDFs stores each file as a series of data blocks (blocks), the default block size is 64MB (can be customized configuration). For fault tolerance, all data blocks of the file can have replicas (the default is 3, which can be customized). When Datanode starts, it traverses the local filesystem, generates a list of HDFS data blocks and local file correspondence, and sends the report to Namenode, which is the report block (Blockreport), which contains a list of all the blocks on the Datanode.
(2) copy storage : The HDFs cluster typically runs on multiple racks, and the communication of the machines on different racks needs to pass through the switch. In general, the storage strategy of the replica is critical, the bandwidth between the nodes within the rack is greater than the bandwidth across the rack nodes, which can affect the reliability and performance of HDFs. HDFS employs a strategy called rack- Aware (rack-aware) to improve data reliability, availability, and utilization of network bandwidth. In most cases, the HDFs replica factor is the default 3,hdfs that holds one copy on the local rack node, one copy on the other node in the same rack, and the last copy on a different rack node. This strategy reduces the data transfer between racks and improves the efficiency of write operations. Rack errors are far less than node errors , so this strategy does not affect the reliability and availability of the data.
Figure 6: The policy of the copy storage
(3) heartbeat detection : Namenode periodically receives heartbeat packets and block reports from each Datanode in the cluster, Namenode can validate mappings and other file system metadata based on this report. A heartbeat packet is received stating that the Datanode is working properly. If Datanode cannot send heartbeat information, Namenode will flag datanode that have not had a heartbeat recently as an outage and will not send them any I/O requests.
(4) Safe mode
(5) Data integrity detection
(6) Space recovery
(7) Invalid metadata disk
(8) Snapshot (HDFs is not currently supported)
Three. HDFs Common shell Operations
(1) List files directory:hadoop fs-ls directory path
View directories under the HDFs root directory: Hadoop fs-ls/
Recursively view directories under the HDFs root directory: Hadoop FS-LSR/
(2) Creating a folder in HDFs:hadoop fs-mkdir folder name
In the root directory, create a folder called Di:
(3) Uploading files to HDFs:Hadoop fs-put Local source path destination storage path
Upload a log file from the local system to the Di folder: Hadoop fs-put Test.log/di
*ps: The files that we upload through the Hadoop shell are stored in blocks (chunks) of datanode, which cannot be seen by the Linux shell and can only be seen by block. Thus, HDFs can be described in a word: The large file of the client is stored in the data block of many nodes.
(4) Download files from HDFs:hadoop fs-get hdfs file path Local storage path
Download the test.log you just uploaded to your local Desktop folder: Hadoop fs-get/di/test.log/home/hadoop/desktop
(5) View a file directly in HDFs:Hadoop fs-text (-cat) file storage path
View the Test.log file that you just uploaded in HDFs: Hadoop fs-text/di/test.log
(6) Delete a file in HDFs (clip):Hadoop fs-rm (r) file storage path
Delete the Test.log file you just uploaded: Hadoop fs-rm/di/test.log
Delete the Di folder in HDFs: Hadoop fs-rmr/di
(7) Use help command for assistance:Hadoop fs-help Command
Help for viewing the LS command: Hadoop fs-help ls
Get a little bit every day------introduction to the HDFs basics of Hadoop