Hadoop consists of two parts: the HDFs and the MapReduce engines. At the bottom is HDFs, which stores files on all storage nodes in the Hadoop cluster. The previous layer of HDFS is the MapReduce engine, which consists of jobtrackers and tasktrackers.
first, the basic concept of HDFs
1. Data Block
HDFs default is the most basic storage unit is 64M of data block, this data block can be understood and the general file inside the same block, unlike the ordinary file system is, HDFs, if a file is smaller than the size of a block of data, it does not occupy the entire block storage space.
2. Meta data nodes and data nodes
A metadata node is a namespace used to manage the file system, which stores metadata for all files and folders in a file system tree. Data nodes are used to store data files. The Meta Data Node (SECONDARYNAMENODE) is not the alternate node of the Meta data node that we think of, in fact its main function is to periodically merge the namespace image file of the metadata node with the modified log to prevent the log file from being too large.
3. Data flow in HDFs
Read the file
The client opens the file with filesystem's open () function, Distributedfilesystem the metadata node with RPC and obtains the data block information of the file. For each chunk, the metadata node returns the address of the data node that holds the data block. Distributedfilesystem returns Fsdatainputstream to the client, which is used to read the data. The client calls the stream's read () function to begin reading the data. The Dfsinputstream connects the closest data node that holds the first chunk of this file. When data is read from the data node to the client, Dfsinputstream closes the connection to this data node and then connects to the closest data node of the next block of data for this file. When the client has finished reading the data, call Fsdatainputstream's close function.
The whole process:
650) this.width=650; "alt=" HDFs Learning "class=" Img-thumbnail "src=" Http://image.evget.com/images/article/2015/hdfs01.png "/>
Write a file
The client invokes the Create () file, and Distributedfilesystem creates a new file in the file System namespace with the RPC call metadata node. The metadata node first determines that the file does not exist, and the client has permission to create the file, and then creates a new file. Distributedfilesystem returns Dfsoutputstream, which the client uses to write data. The client begins to write the data, dfsoutputstream the data into chunks, and writes it to the data queue. The data queue is read by data streamer and notifies the metadata node to allocate data nodes, which are used to store chunks (each of which replicates 3 blocks by default). The assigned data node is placed in a pipeline. Data Streamer writes a block to the first data node in the pipeline. The first Data node sends a block of data to the second data node. The second data node sends the data to a third data node. Dfsoutputstream holds an ACK queue for the emitted data block, waiting for the data node in the pipeline to tell the data has been successfully written. If the data node fails during the write process: Closes pipeline, and the data block in the ACK queue is placed at the beginning of the database queue.
The whole process:
650) this.width=650; "alt=" HDFs Learning "class=" Img-thumbnail "src=" Http://image.evget.com/images/article/2015/hdfs02.png "/>
Ii. Advantages and disadvantages of HDFs
Advantages of 2.1 HDFs
1) Handling Oversized files
The oversized files here usually refer to hundreds of megabytes and hundreds of terabytes of file size. Currently, HDFs can be used to store and manage petabytes of data in real-world applications.
2) streaming access to data
The design of HDFs is based on a more responsive "one-write, multiple-read-write" task. This means that once a data set is generated by a data source, it is copied and distributed to different storage nodes, and then responds to a variety of data Analysis task requests. In most cases, the analysis task involves most of the data in the dataset, which means that for HDFs, it is more efficient to read the entire dataset than to read a record.
3) run on a cheap commercial machine cluster
Hadoop is designed to be low on hardware requirements and only run on low-cost commercial hardware clusters without the need for expensive high-availability machines. Cheap commercial machines also mean that there is a high probability of node failure in large clusters. This requires that the design of HDFs should take into account the reliability, security and high availability of data.
Disadvantages of 2.2 HDFs
1) Not suitable for low latency data access
HDFs does not work if you want to handle low-latency application requests that require shorter periods of time for users. HDFs is designed to handle large data set analysis tasks, primarily for high data throughput, which may require high latency as a cost.
Improved strategy: HBase is a better choice for applications that have low latency requirements. Make up for this deficiency as much as possible with a top-level data management project. There is a great improvement in performance, and its slogan is goes real time. Using a cache or multi-master design can reduce the data request pressure on the client to reduce latency. There is the internal modification of the HDFS system, which has to weigh the large throughput and low latency, HDFS is not a universal silver bullet.
2) Unable to efficiently store large numbers of small files
Because Namenode places the file system's metadata in memory, the number of files the file system can hold is determined by the size of the Namenode memory. In general, each file, folder, and block needs to occupy about 150 bytes of space, so if you have 1 million files, each occupying a block, you need at least 300MB of memory. Currently, millions of of the files are still viable, and when scaled to billions of, it is not possible to achieve the current level of hardware. Another problem is that because the number of map tasks is determined by splits, when you use Mr to process a large number of small files, you generate too much maptask, and the thread management overhead increases the job time. For example, processing 10000M files, if each split is 1M, there will be 10,000 maptasks, there will be a lot of thread overhead, if each split is 100M, then only 100 maptasks, each maptask will have more things to do, The management overhead of threads will also be much reduced.
Improved strategy: There are many ways to get HDFs to handle small files.
Using Sequencefile, MapFile, Har and other ways to archive small files, the principle of this method is to archive small files to manage, HBase is based on this. For this method, if you want to retrieve the original small file content, you have to know the mapping relationship with the archive file.
Scale-out, with a limited number of small files that a Hadoop cluster can manage, drag several Hadoop clusters behind a virtual server to form a large Hadoop cluster. Google has done the same thing.
Multi-Master Design, this role is obvious. The GFS II in development is also to be distributed multi-master design, also support master failover, and block size changed to 1M, intentionally tuned to handle small files ah.
With a Alibaba DFS design, but also a multi-master design, it separates the metadata mapping storage and management, consisting of multiple metadata storage nodes and a query master node.
3) does not support multi-user write and arbitrary modification of files
There is only one writer in a file in HDFs, and the write operation can only be done at the end of the file, that is, the append operation can only be performed. Currently HDFS does not support multiple users writing to the same file, as well as modifying it anywhere in the file.
Distributed File System HDFs parsing