Data-intensive Text Processing with mapreduce chapter 2nd: mapreduce BASICS (3)

Source: Internet
Author: User
Directory address for this book Note: http://www.cnblogs.com/mdyang/archive/2011/06/29/data-intensive-text-prcessing-with-mapreduce-contents.html
2.5 Distributed File System HDFS

Traditional large-scale data processing problems from the perspective of data placement

Previous focusProcessing. However, if there is no data, there is no way to deal with it.In traditional cluster architecture (such as HPC), computing and storage are two separate components.. Although the implementation of different systems varies, the overall idea is the same: the computing node reads data from the storage node, processes the data, and writes the result.

As the amount of data increases, the demand for computing power for data processing is also increasing. With the increase in computing power, the connection bandwidth between storage nodes and computing nodes has gradually become the bottleneck of the overall system performance. To solve this problem, you can buy more advanced network devices and increase bandwidth. This is often very expensive, because as the transmission rate increases, the price of network equipment increases higher than linear: devices with 10 times the speed tend to charge more than 10 times.Another method is to avoid the separation of computing and storage..

The underlying Distributed File System (DFS) of mapreduce adopts the latter method. Google's own mapreduce implementation adopts the Google Distributed File System GFS (googlefile system), while the open-sourceHadoop uses hadoop Distributed File System (HDFS). It can work without DFS mapreduce, but it will lose a major advantage (combining computing and storage ).

GFSAnd HDFS

DFS was nothing new when mapreduce was born, but the DFS (hereinafter referred to as DFS) used by mapreduce (such as GFS and HDFS) was improved based on previous work, this makes it more suitable for large-scale data processing.The core concept of DFS is data partitioning and replication.. Data Partitioning is not a new concept, but it is much larger than a block with a size of several KB (usually 64 MB by default) compared to a local disk ). DFS adopts the master-slave architecture. The master maintains the file table (metadata, file structure, file-block ing, block address, and access control information ), slave stores actual file data. In GFS, the master is called gfs.

Master, slave is called a chunkserver ). In HDFS, master and slave are called namenode and datanode respectively ). As this book uses hadoop as the benchmark, HDFS is used as the DFS. HDFS architecture similar to GFS, as shown in Figure 2.5.

 

Figure 2.5 HDFS Architecture

HDFSRead files

To access a file in HDFS, you must first access namenode to determine the file storage location. For client requests, namenode will return the node number (on which datanode exists in the file) and block number (where the file exists with this datanode ). The client accesses the corresponding datanode Based on the node number and block number to obtain the file data. The block on datanode is stored in the local disk as a file, so HDFS needs to run on an operating system like Linux. Note that file data transmission only occurs between the client and datanode without passing through namenode. Only metadata is transmitted between the client and namenode.

Data Replication)

HDFS places three copies of each data block by default. It has the following functions:

1) ensures the reliability, availability and performance of HDFS ).

In a large cluster, the three pieces of data are usually stored on machines in different racks. This ensures that the data is still available when a single point of failure or the entire rack suffers a network disconnection.

2) Better Data Locality to avoid transmission of large amounts of data across nodes.

System status monitoring

Namenode checks all copies of each data block at a fixed cycle. If the number of copies is insufficient (the reason for the insufficient number of copies is generally a node error: assume that data block B is stored on N1, N2, and N3 nodes. If n1 fails, namenode can detect that only N2 and N3 copies of accessible B Copies are available ), namenode copies a sufficient number of data blocks (for example, copying a copy of B and placing it on N4, so that B recovers 3 copies ). If the number exceeds (for example, if N1 is recovered from the fault, B has four copies), the excess copies are deleted.

HDFSWrite files

To create a file in HDFS, follow these steps:

1) The client initiates a request to the namenode. The namenode checks the permission and checks whether a file with the same name already exists;

2) If the check succeeds, namenode will create a data block on a datanode;

3) The client is directed to the data block to perform the write operation;

4) data blocks are replicated to a sufficient number of copies and distributed to other nodes for storage.

Files in HDFS cannot be changed: After a file is written, the file cannot be modified (for example, overwrite and append ). The latest news is that the official Apache Foundation plans to add the append feature to HDFS, which has been implemented in GFS.

NamenodeSummary

In short, namenode in HDFS has the following responsibilities:

1) namespace management: Metadata, file/directory structure, file/data block ing, data block positioning, and access control;

2) Coordinate file operations:

(1) direct the client to the correct data block during read operations;

(2) Allocate data blocks during write operations to guide the client to write data;

Note that namenode only directs read/write operations to datanode. Real file data transmission only occurs between the client and datanode.

(3) After a file is deleted, the disk space is not released immediately, but is recycled by the GC (garbagecollector, Garbage Collector) of HDFS;

3) maintain the health of the entire file system.

(1) namenode uses Heartbeat message to monitor the active status of datanode. Once the number of data block copies decreases due to node errors, namenode will copy the data block, ensure that each data block has enough copies; otherwise, the namenode will delete unnecessary copies;

(2) load balancing. After a period of regular operations, the system may produce data distribution skew. namenode will move some data blocks, so that the data distribution is re-balanced to achieve the highest device utilization.

DFSRelationship with mapreduce

HDFSBoth GFS and GFS are designed specifically for the corresponding mapreduce framework. Therefore, some design features are designed to meet the needs of the mapreduce computing environment.. Understanding these features helps design more efficient mapreduce programs:

1) the file system usually contains only a small number of large files. The degree of "less" depends on the actual deployment scale. HDFS encourages the use of a small number of large files (in fact, several GB of files are common), and tries to avoid the use of a large number of small files. The reasons for doing so are as follows:

(1) namenode stores metadata in memory, so the maximum size of the supported file tables is limited. A large number of small files will lead to a large amount of metadata. Therefore, a large part of a file is more efficient than multiple small files without blocks.

(2) mapper in hadoop reads files as the unit of data. Currently, one mapper of hadoop can only read one file at a time, and there is no mechanism to process multiple files at a time. Therefore, how many input files need to be created. when the input is a lot of small files, on the one hand, a lot of mapper needs to be started for processing, and each mapper has a very small amount of work to execute, this results in many system resources wasted on Mapper queuing, startup, and end operations, reducing system efficiency. On the other hand, more network transmission is required at the end of the map stage (as mentioned above, more network transmission is required after the map stage ends.M×NTransmission times ).

2) HDFS targets large-scale and batch data processing: massive and continuous reading and writing. In this application scenario, stable high bandwidth is more important than low latency. This meets the working characteristics of mapreduce: Batch and large-scale data processing. Therefore, for general computing tasks, neither HDFS nor GFS considers the cache mechanism;

3) Both HDFS and GFS do not support the POSIX standard access excuse, which simplifies the system design. The cost is that some data management work is transferred to the application layer program design. However, this makes sense: mapreduce is a model used to design a large-scale data processing program. Transferring some data processing work to the program design can make the program design more flexible, the program is designed to better process data.

4) the DFS design assumes that users are in a cooperative relationship (that is, all operations are trusted ). The security of file systems was not discussed in the earliest GFS papers. HDFS also assumes that only authorized users can access the HDFS file cluster. Permission control in HDFS is only used to prevent data damage caused by misoperations and can be easily bypassed.

5) The system consists of a large number of ordinary PCs. In such an environment, machine failures are common (rather than exceptions ). Therefore, HDFS has a set of self-monitoring and recovery mechanisms to ensure the normal operation of the system.

(This section is about CAP theory, but it is not easy to translate for the moment. You also need to carefully study the relevant literature)

The single master design of GFS and HDFS has a well-known weakness: Once the master fails, the entire system will be paralyzed. Operations on the master node are relatively lightweight: Only metadata is exchanged through the master node, and the data size is small. Therefore, the master node will not be the performance bottleneck of the system, nor will it be down due to overload. In fact, due to the addition of the namenode monitoring mechanism, the impact of namenode single point of failure on the DFS system is not as serious as imagined. Mean Time

Between failures, MTBF) can reach several months. On the other hand, the hadoop community is aware of this problem and provides some corresponding policies, such as backing up namenode (backing up data with the same as the primary namenode is always ready. When the primary namenode fails to provide services, the backup namenode is automatically switched to the primary namenode ). There are also some applications that deploy products using hadoop. With the development of these applications, some other countermeasures will also be developed by community developers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.