As one of the core technologies of Hadoop, HDFs (Hadoop Distributed File System, Hadoop distributed filesystem) is the foundation of data storage management in distributed computing. It has high reliability, high scalability, high availability and high throughput rate. It facilitates the application of large datasets.
First, the premise and purpose of the design
HDFs is an open source implementation of Google's GFS (Google File System). Has the following five basic goals:
- Hardware errors are normal, not errors. HDFs generally runs on normal hardware, so hardware errors are a normal situation. So in HDFs, error detection and fast automatic recovery are the most core design goals of HDFs.
- Streaming data access. Applications running on HDFS are primarily bulk-processed, rather than user-interactive transactions, with streaming data read as many.
- Large data sets. Typical file sizes in HDFs are gigabytes or terabytes.
- Simple consistency principle. HDFS applications generally operate on files in a single write, multiple-read mode. Once the file is created, written, and closed, the general file content is changed again. This simple consistency principle makes it possible to access high-throughput data.
- Data proximity principle. HDFS provides an interface so that an application can move its own execution code to a data node for execution. The main reason for this approach is that mobile computing is more cost effective than moving data. Compared with large data/large files in HDFs, mobile computing is more cost-effective than mobile data, which can provide broadband utilization, increase system throughput, and reduce network congestion.
Ii. HDFs Architecture
HDFs is a master-slave structure (master/slave). :
From this diagram, we can see that in HDFs, mainly consists of two types of nodes, one is Namenode (NN), one is Datanode (DN).
Namenode is the master control server, which manages the namespace of the HDFs file system, records the location and replica information of the file database on each datanode node, and coordinates the client's access/operation to the file. As well as changes in the record namespace or the properties of the namespace itself.
Datanode is a data storage node that is responsible for storage management on the physical node on which it resides. The file store in HDFs is stored in blocks (block), and the default size is 64MB.
Client operation data, only through Namenode to obtain the physical location of the Datanode node, for the write/read data of the specific operation, Namenode will not participate, all by the datanode responsible.
Since there is only one Namenode node in HDFs, there is a single point of problem, that is, if the Namenode node is down, then HDFs will have problems and the data may be lost. The solution is to start a secondarynamenode or write out Namenode data to other remote file systems.
Three, HDFs reliability assurance measures
One of the main design objectives of HDFS is to ensure the reliability of data storage in the event of a failure. HDFs has a well-developed redundant backup and recovery mechanism. The number of backup copies is usually set by Dfs.replication, by default 3.
- Redundant backups. Writes data to multiple Datanode nodes, and when some of these nodes are down, you can get the data from the other nodes and copy them to the other nodes, so that the number of backups reaches the set value. Dfs.replication set the number of backups.
- copy storage. HDFs uses a rack-aware (Rack-aware) strategy to improve data reliability, availability, and network bandwidth utilization. When the replication factor is 3 o'clock, the copy-holding policy for HDFS is: The first copy is placed on another node in the same rack (executed in the cluster)/random node (executed outside the cluster). The second copy is placed on any other node in the local rack. A third copy is placed on any node in the other rack. This strategy can prevent data loss when the entire rack fails, or take advantage of high-bandwidth effects in the rack.
- Heartbeat detection. Namenode periodically receives heartbeat packets and block reports from each Datanode in the cluster, Namenode validates the mappings and other file system metadata based on these reports. When Namenode cannot receive a heartbeat report from the Datanode node, Namenode will mark the Datanode as down and Namenode will not send any IO operations to the Datanode node. Simultaneous Datanode outages can also result in data duplication. There are several reasons why a re-replica is generally raised: Datanode unavailable, corrupted data copy, disk error on Datanode, or increased replication factor.
- Safe Mode. In the HDFs system, a full mode is passed, in which the data block's write operation is not allowed. Namenode detects that the number of copies of the data block on the Datanode does not reach the minimum number of copies, then it goes into full mode and starts copying copies, leaving Safe mode automatically only when the number of replicas is greater than the minimum number of copies. Datanode node effective ratio: dfs.safemode.threshold.pct (default 0.999f), so that when the Datanode node is lost to 1-0.999f, it enters safe mode.
- Data integrity detection. HDFS implements the checksum detection (CRC loop Check code) for the contents of the HDFs file, and writes the checksum of the data block to a hidden file () when the data file is written. When the client obtains the file, it checks to see if the checksum obtained from the Datanode node is consistent with the checksum in the hidden file, and if not, the client will assume that the database is corrupt and will fetch chunks of data from the other Datanode nodes. The data block information for the Datanode node of the Namenode node is reported.
- Recycle Bin. Files that are deleted in HDFs are saved to a folder (/trash) for easy data recovery. When the deletion takes longer than the set time valve (the default is 6 hours), HDFs deletes the data block completely.
- Image files and transaction logs. Both of these data are the core data structures in HDFs.
- Snapshot. Not very clear ... Oh
Iv. HDFs Command
- Bin/hadoop dfs-ls: The command is similar to the LS command in Linux, and it is all about listing files.
- Bin/hadoop dfs-put src dest: Upload files to HDFs.
- Bin/hadoop dfs-get src dest: Copy the HDFs file to the local system.
- Bin/hadoop dfs-rmr Path: Delete file.
- Bin/hadoop dfs-cat in/*: View the contents of the file.
- Bin/hadoop dfsadmin-report: View the basic statistics for HDFS, as well as access http://fs4:50070.
- Bin/hadoop dfsadmin-safemode enter/leave/get/wait: Operation Safe Mode.
- Bin/start-dfs.sh: In the new node execution, the HDFs new node can be added.
- bin/start-balancer.sh: Load Balancing.
HADOOP-HDFS Architecture