Hadoop's HDFs and Namenode single point of failure solutions

Source: Internet
Author: User
Tags prepare zookeeper node server
Http://www.cnblogs.com/sxt-zkys/archive/2017/07/24/7229857.html

Hadoop's HDFs

Copyright Notice: This article is Yunshuxueyuan original article.
If you want to reprint please indicate the source: http://www.cnblogs.com/sxt-zkys/
QQ Technology Group: 299142667

HDFs Introduction

HDFS (Hadoop Distributed File System) Hadoop distributed filesystem. is based on a copy of a paper published by Google.

What is a distributed file system

Distributed File System (distributed) means that the physical storage resources managed by the file system are not necessarily directly connected to the local nodes, but are connected to the nodes through the computer network. The design of a distributed file system is based on client/server mode.

[Advantages]

Large file support for oversized files here refers to files of hundreds of m, hundreds of GB, or even a few terabytes in size.

detect and quickly respond to hardware failures in a clustered environment, a hardware failure is a common problem. Because there are thousands of servers connected together, this can lead to high failure rates. Therefore, fault detection and automatic recovery are a design target of HDFs file system.

Streaming data access applications can access datasets in the form of streams. The main is the throughput of data, not the speed of access.

Simplified consistency Model Most HDFS operations files require one write and multiple reads. In HDFs, once a file has been created, written, and closed, it is generally not necessary to modify it. This simple consistency model is good for improving throughput.

[Cons]

Low-latency data access , such as applications that interact with users, requires that the data be responded to in milliseconds or seconds. Because Hadoop optimizes for high data throughput, it sacrifices the latency of acquiring data, so for low latency it is not suitable for Hadoop.

a large number of small files HDFs supports very large files, which are distributed over data nodes, and the metadata of the data is stored on the name node. The memory size of the name node determines the number of files that the HDFs file system can hold. Although the system memory is relatively large today, a large number of small files will affect the performance of the name node.

multiple users writing files, modifying files HDFs files can only be written once, do not support writing, and do not support modification. Only then can the throughput of the data be large.

transactions that do not support super-strong do not have strong support for transactions like relational databases.

[HDFS structure]

NameNode: The manager of Distributed File system is responsible for managing the file system's namespace, cluster configuration information and storage block replication. Namenode will store the file system's Meta-data in memory, which mainly includes the file information, the corresponding file block of each file, and the information of each file block in Datanode.

secondarynamenode: Merge Fsimage and fsedits and then send to Namenode.

DataNode: is the basic unit of the file storage, it stores the block in the local file system, saves the Block's meta-data and periodically sends all the existing block information to Namenode.

Client: is the application that needs to obtain distributed file system files.

fsimage: Metadata image file (file system directory tree. )

edits: metadata operation log (record of modification operation for file system)

NameNode, Datanode, and client communication modes:

Between the client and the Namenode is through RPC communication;

Between Datanode and Namenode is through RPC communication;

Between the client and Datanode is through a simple socket communication.

process of client reading data in HDFs

1. The client opens the file that it wants to read by calling the open () method of the FileSystem object.

2. Distributedfilesystem calls Namenode by using RPC to determine the location of the file's starting block. [Note 1]

3. The client calls the read () method on the input stream.

4. Dfsinputstream[Note 2] , which stores the Natanoe address of the file's starting block, links to the nearest datanode. You can transfer data from Datanode to the client by repeatedly invoking the read () method on the data stream. [Note 3]

5. When the fast end is reached, Dfsinputstream closes the connection to the Datanode and then looks for the next best datanode for the courier.

6. The client read data is read in the order in which the card is opened Dfsinputstream and datanode new connection. It needs to ask Namenode to retrieve the location of the next batch of required Datanode. Once the read is completed, call Fsdatainputstream to call the Close () method.

[Note 1]: For each block, Namenode returns the Datanode address where the copy of the block exists. These datanode are sorted according to their distance from the client, and if the client itself is a datanode and holds a copy of the response data block, the node reads the data from the local datanode.

[Note 2]:D I is the tribute File system class that returns a Fsdatainputstream object to the client and reads the data. The Fsdatainputstream class instead encapsulates the Dfsinputstream object, which manages I/O for Datanode and Namenode.

[Note 3]: If Dfsinputstream encounters an error while communicating with Datanode, it tries to read the data from the other nearest datanode of the block. It also remembers which fault is natanode to ensure that subsequent blocks on that node are not read back and forth again. Dfsinputstream also verifies that the data sent from Datanode is complete by verifying the checksum. If a corrupted block is found, it notifies namenode before the Dfsinputstream view reads a copy of a block from another datanode.

client writes data to the HDFS process

1. Client invokes the Create () method of the Distributedfilesystem object, creating a file output stream

2. Distributedfilesystem creates an RPC call to Namenode and creates a new file in the file system's namespace.

3. Namenode performs various checks to ensure that the file does not exist and that the client has permission to create the file. If these checks pass, Namenode records a record for creating a new file, otherwise, the file creation fails and throws IOException to the client. Distributedfilesystem returns a Fsdataoutputstream formation to the client, and the client can begin writing data.

4. Dfsoutputstream divides it into packets and writes to the internal queue. Datastreamer handles the data queue, and it is responsible for storing data backups based on the Datanode list to require Namenode to allocate suitable new blocks. This set of Datanode constitutes a pipeline---we assume that the number of replicas is 3, there are 3 nodes in the pipeline, Datastreamer the packet Flow bed book to the first Datanode in the pipeline, the Dananode stores the packet and sends it to the second datanode in the pipeline , in the same way, the second Datanode stores the packet and sends it to the 3rd one in county.

5. Dfsoutputstream also maintains an internal packet queue to wait for Datanode to receive acknowledgment receipts (ACK queue). The packet is removed from the confirmation queue when all Datanode acknowledgement information in the pipeline is received. [Note 1]

6. After the client finishes writing the data, call the Close () method back to the data stream

7. Write all remaining packets to the Datanode pipeline, and before you practice Namenode and send the file to the completion signal.

[Note 1]: If Datanode fails during data write, then: 1. Close the pipeline and confirm that any packets in the queue are added back to the front end of the data queue, and the Datanode does not return to the packet downstream of the failed node. 2. Specify a new flag for the current block of data that is stored in another normal datanode, and pass the flag to Namenode so that the fault datanode can delete some of the stored data blocks after recovery. 3. Remove the failed data node from the pipeline and write the remaining data blocks to the two normal datanode in the pipeline. Namenode Notice that a new copy is created on the other node when the volume of the replica is insufficient.

Namenode single point of failure solution in Hadoop

The Hadoop 1.0 kernel consists of two branches: MapReduce and HDFs, the design flaws of these two systems are single point of failure, that is, Mr Jobtracker and HDFs Namenode two core services have a single point of problem, This is only a solution to the Namenode single point of failure of HDFs.

[Question]

HDFs's distributed storage System, modeled on Google GFs, consists of two services, Namenode and Datanode, where Namenode stores metadata information (fsimage) and operation logs (edits), because it is unique, The availability of the entire storage system is directly determined by its usability. Because the client accesses the name node server prior to the read and write operations of HDFS, the client can continue to read and write only after it obtains the metadata from name node. Once the namenode fails, it will affect the use of the entire storage system.

[Solutions]

Hadoop officially provides a quorum journal manager for high availability, and in a highly available configuration, edit log no longer resides in the name node, but is stored in a shared storage location consisting of several journal nodes, typically 3 node (Jn small cluster), each JN is dedicated to the editing log from the NN, and the edit log is written by the active state name node.

To have 2 NN nodes, only one of them is active (active), the other is standby (standby), only the active node can provide read-write HDFs service to the outside, and only the active state nn can write the edit log to Jn. The standby name node is only responsible for copying data from JN nodes in the JN small cluster to local storage. In addition, each data node also reports status (Heartbeat information, block information) to two Namenode nodes at a time.

A master one from the 2 namenode nodes at the same time and 3 Jn composed of the group to maintain communication, the active Namenode node is responsible for writing to the JN cluster edit log, the standby NN node is responsible for observing the JN Group edit log, and pull the log to the standby node (take over secondary Namenode's work). Together with the two node's respective fsimage image files, this ensures that the metadata for the two NN remains in sync. Once active is unavailable, standby continues to provide service. The architecture is divided into manual mode and automatic mode, where manual mode refers to the primary and standby switch by the administrator through the command, which is usually useful when the service is upgraded, the automatic mode can reduce the operation and maintenance cost, but there is a potential danger. The schemas in both of these modes are as follows.

[manual mode]

Simulation Process:

1. Prepare 3 servers to run the Journalnode process (or on a Date node server), prepare 2 Namenode servers for running the Namenode process (two configurations are the same), and no limit on the number of Datanode nodes.

2. Start the Journalnode process on the 3 JN servers, starting the Datanode process on the Date node server, respectively.

3. You need to synchronize the metadata between the 2 name node. Practice: Copy meta data from the first nn to another NN, then start the first Namenode process, and then do standby boot on the other name node.

4. Initialize the edit log of the first honour point to the JN node for the standby node to pull data from the JN node.

5. Start the Namenode node of the standby state so that the Fsimage file can be synchronized.

6. Simulate the failure, manually transfer the active state of the NN fault, to another namenode.

[Auto mode]

Simulation Process:

ZKFC (Dfszkfailovercontroller) and zookeeper clusters are introduced in manual mode

ZKFC is mainly responsible for: health monitoring, session management, leader election

Zookeeper Cluster main responsibility: Service synchronization

1-6 Steps with manual mode

7. Prepare 3 hosts to install zookeeper,3 host to form a small zookeeper cluster.

8. Start the Quorumpeermain process on each node of the ZK cluster

9. Log in to one of the NN and initialize the HA state in ZK

10. Analog failure: Stop the active namenode process, pre-configured zookeeper will automatically change the standby node to active, continue to provide services.

Brain Fissure

Brain fissure refers to when the primary and standby switch, due to the switch is not complete or other reasons, causing the client and slave mistakenly think that there are two active master, eventually the whole cluster in a chaotic state. To solve the problem of brain fissure, the isolation (Fencing) mechanism is usually used.

shared storage Fencing: Ensure that only one master writes data to the shared storage, using QJM to implement fencing.

Qurom Journal Manager, based on Paxos (messaging-based conformance algorithm), the Paxos algorithm is a solution to how a value is agreed on in a distributed environment.

[principle]

A. After initialization, active writes Editlog log to JN, each editlog has a number, each write Editlog as long as most of the JN return success (more than half) is determined to write successfully.

B. Standby periodically reads a batch of Editlog from the JN and applies it to the in-memory fsimage.

C. Namenode each write editlog need to pass a number epoch to JN,JN will be compared to the epoch, if it is larger or the same as the epoch of its own save, you can write, JN update their epoch to the latest, otherwise refused to operate. When standby is switched to active, the epoch+1 is turned on, which prevents the previous namenode from writing the log to JN.

Client Fencing: Ensure that only one master can respond to client requests.

[principle]

A layer is encapsulated in the RPC layer, and the NN is connected by failoverproxyprovider in a retry manner. The effect on the client is to add a delay when attempting to connect to a new NN after several connections to an NN failure. The client can set the retry time and

Slave Fencing: Make sure only one master can send commands to Slave.

[principle]

A. When each NN changes state, it sends its own status and a serial number to the DN.

B. The DN maintains this serial number while it is running, and when failover, the new NN returns its own active state and a larger sequence number when it returns the DN heartbeat. The DN received this return is considered the NN as the new active.

B. If the original active (such as GC) resumes, the heartbeat information returned to the DN contains the active state and the original serial number, and the DN rejects the NN command.

Finally, I would like to thank the teachers of the school week in my study process to give help.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.