Links to HDFs
http://hadoop.apache.org/docs/current/api/(Apache Hadoop Main 2.7.1 API)
HTTP://SLAYTANIC.BLOG.51CTO.COM/2057708/1101111/(Hdfs-site.xml configuration item description)
HTTP://ARCHIVE-PRIMARY.CLOUDERA.COM/CM5/INSTALLER/5.4.3/(Cloudera-manager-installer.bin)
HTTP://ARCHIVE-PRIMARY.CLOUDERA.COM/CDH5/(Cloudera official library)
Http://www.cloudera.com/content/cloudera/en/home.html (Cloudera official website)
http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html (HDFs shell command)
Http://www.cnblogs.com/xia520pi/archive/2012/05/28/2520813.html (Reference article)
About HDFs
The HDFS (Hadoop distributed File System) is the core sub-project of the Hadoop project and is the basis for data storage management in distributed computing, developed based on the need for streaming data mode access and processing of oversized files, and can be run on inexpensive commercial servers. It has the characteristics of high fault tolerance, high reliability, high expansibility, high availability, high throughput, and so on, which provides the storage for the large amount of data, which is convenient for the application processing of Large data set.
HDFs Basic Concept
1 Data block (block)
The default most basic storage unit for HDFS (Hadoop distributed File System) is a 64M block of data.
As with ordinary file systems, the files in HDFs are stored in chunks of data that are partitioned into 64M blocks.
Unlike the normal file system, HDFs, if a file is smaller than the size of a block of data, does not occupy the entire block of storage space.
2 Namenode and Datanode
HDFS architecture has two types of nodes, one is Namenode, also known as "meta-data Node", the other is Datanode, also known as "Data Node". These two types of nodes assume the execution nodes of the master and worker specific tasks respectively.
2.1 Metadata node to manage the namespace of the file system
It saves the metadata for all files and folders in a file system tree.
This information will also be saved to the following file on your hard disk: namespace image (namespace image) and Changelog (edit log)
It also preserves which data blocks are included in a file and which data nodes are distributed. However, this information is not stored on the hard disk, but is collected from the data node when the system is started.
2.2 Data nodes are where the data is actually stored in the file system.
The client or metadata information (NameNode) can request a data block to be written to or read from the data node.
Its periodic return to the metadata node for its stored data block information.
2.3 From the Meta Data Node (secondary namenode)
The metadata node is not the alternate node when the metadata node is having problems, it is responsible for different things with the metadata node.
Its main function is to periodically merge the namespace image file of the metadata node with the modified log to prevent the log file from being too large. This will be believed in the narrative below.
The merged namespace image file is also saved from the metadata node, which can be recovered when the metadata node fails.
3 File System namespace image file and modify log
1) When the file system client is writing, it is first recorded in the Changelog (edit log)
2) The Metadata node stores the file system's metadata information in memory. After the modification log is recorded, the metadata node modifies the data structure in memory.
3) Each time the write operation succeeds, the change log synchronizes (sync) to the file system.
4) The Fsimage file, also known as the namespace image file, is the checkpoint of the in-memory metadata on the hard disk, which is a serialized format and cannot be modified directly on the hard disk.
5) Similar to the data mechanism, when the metadata node fails, the latest checkpoint metadata information is loaded into memory from Fsimage and then re-executed in the modification log.
6) from the Meta data node is used to help the metadata node to checkpoint the in-memory metadata information to the hard disk
The checkpoint process is as follows:
1. A new log file is generated from the Metadata node notification metadata node, and subsequent logs are written to the new log file.
2. Obtain the Fsimage file and the old log file from the metadata node with HTTP get from the metadata node.
3. From the metadata node, load the Fsimage file into memory, perform the operations in the log file, and then generate a new Fsimage file.
4. From the Meta Data Node award new Fsimage file with HTTP post back to metadata node
5. The metadata node can replace the old fsimage file and the old log file with the new Fsimage file and the new log file (generated in the first step), and then update the Fstime file to write the time of this checkpoint.
6. In this way, the Fsimage file in the metadata node holds the most recent checkpoint metadata information, and the log file is restarted, and will not become very large.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Big Data-learning about HDFS