Chapter Sixth HDFS Overview

Source: Internet
Author: User

Chapter Sixth HDFS Overview
6.1.2 HDFs Architecture
HDFs uses a master-slave structure, NameNode (file System Manager, responsible for namespace, cluster configuration, data block replication),
DataNode (the basic unit of file storage, which saves the data checksum information of the file contents and data blocks, performs the underlying block IO operation),
Client (and name node, data node communication, access to HDFs file system, operation file),
Secondarynamenode
1. Data block
Linux EXT3 block default size 4096 bytes, HDFs block default 64M, number of replicas is 3, data block benefits: File saved on different disk, simplify storage subsystem, easy fault tolerance
and data replication
2. Name node and second name node
The name node maintains the file directory tree of the entire file system, the meta-information of the file directory, and the data Block index of the file. This information is stored in a local file system in two ways,
One is file system mirroring (Fsimage, which stores information for a particular moment), a file system image editing log (editlog, saving change information)
The runtime client obtains the above information through the name node and then interacts with the data node to read and write the file data. Namenode get some information on the overall operational status of HDFs
such as used space, unused space, the state of Datanode. Secondnamenode is a daemon that is used to periodically merge namespace mirrors and mirror edit logs, Secondnamenode
Instead of receiving or recording any real-time changes to HDFs, the namespace image and the mirrored edit log of the HDFs point in time are continuously acquired based on the time interval of the cluster configuration, merging
To a new namespace image, the image is propagated to the name child node, replacing the original namespace image and emptying the edit log.
Name node single point of failure, HDFS HA
3. Data node
The Datanode Daemon writes the HDFs blocks to the actual files of the Linux local file system, or reads blocks of data from the actual file. When the client makes a file content operation,
The Namenode informs the client at which data point each data block is, and then communicates directly with the data node, processing the local file corresponding to the data block.
Datanode will communicate with other Datanode to replicate the data and ensure the redundancy of the data.
Datanode as a Slave node, will continue to report to Namenode. At initialization time, Datanode informs the name node of the currently stored data block, and subsequent Datanode are continually updated
Namenode, provides locally modified information, and accepts instructions from Namenode to create, move, and delete data blocks for local disks.
Hadoop split text code to write their own, how to divide can be, generally with the MapReduce default handler, that is, to the head of the end of the way, by Byte split,
Read from the splitter point to the carriage return is officially the beginning of this paragraph, read to the end point before continuing to read the next carriage return to formally end the current paragraph. This ensures that each segment
Are the entire row of data composition. In addition to Hadoop, the splitter scheme for the collector is the same, and the text file is processed directly in parallel.


4, the Client
The client is the means by which the user interacts with HDFs, including the command line, Java Api,thrift interface, etc.
Distributedfilesystem inherits from the Org.apache.hadoop.fs.FileSystem implementation of related transactions that handle HDFS files and directories. Dfsdatainputstream and
The Dfsdataoutputstream respectively implements Fsdatainputstream and Fsdataoutputstream to provide input and output streams for reading and writing HDFs. Filestatus can get the file's
State
6.1.3 HDFs Source code structure
HDFs source code under the Org.apache.hadoop.hdfs package
1, Basic package. Hdfs.security.token.block and Hdfs.security.token.delegation combined with the Hadoop security framework, integrated with the Kerberos standard
2, HDFs entity package. Hdfs.server.common includes features such as name nodes and data node sharing, such as system upgrades and storage space information
Hdfs.protocol provides an interface through IPC interaction between various entities in HDFs
Hdfs.server.namenode,hdfs.server.datanode and HDFs contain the implementation of the name node, data node, and client respectively
Hdfs.server.namenode.metrics and Hdfs.server.datanode.metrics implement the collection of metric data on name nodes and data nodes. Metrics data includes
Name node process and data node count of events on process
3. Application package Hdfs.tools and Hdfs.server.balancer, these two packages provide query HDFs status information tool dfsadmin, File Check tool fsck and HDFs equalizer
The realization of Balancer
6.2 Interfaces based on remote procedure calls
The architecture of HDFS includes the name node, the data node, and the 3 main roles of the client, which have two main communication interfaces:
Hadoop Remote Procedure call interface, TCP or HTTP-based streaming interface
The IPC interfaces for each of the HDFS nodes fall into three main categories:
(1) Client-related interface, defined in the Org.apache.hadoop.hdfs.protocol package, the interface ClientProtocol the client and the name node
Clientdatanodeprotocol interfaces for clients and data nodes
(2) interface between servers
Datanodeprotocol the interface between the data node and the name node, interdatanodeprotocol the interface between the data node and the data node
Namenodeprotocol interface between the second name node, the HDFs equalizer, and the name node
6.2.1 Client-related interfaces
The abstraction of the data block in HDFs is Org.apache.hadoop.hdfs.protocol.Block, which contains three member variables, all of which are long integers
The unique identifier of the Blockid data block, which is the block ID, the block name of the data blk_<id>
NumBytes data block file data size
The version number of the Generationstamp data block, which is the block timestamp
The Blocklocalpathinfo client discovers that the block of data to be read is located exactly on the same machine, and it can read the local file without reading the data block through the data node.


ClientProtocol interface functions:
The client requests a new block of data through Addblock (), Addblock returns the address of a data node, and begins writing the data before
The corresponding data node crashes for some reason, the client does not contact this node, the client will notify the name node via Abandonblock (), discard this data block,
and call the Addblock () method again to request a new block of data, and put the information of the original data node into the parameters, to ensure that the name node does not return to the collapsed node
FSNYC () guarantees that the name node changes the metadata to be saved to the disk, and does not guarantee the persistence of data node data.
Complete () is used to implement the close () method of the output stream, communicating only with the name node, informing the client that the operation has completed writing the file
Client crashes, the client calls Clientprotocol.renewlease (), sends heartbeat information to the name node if the name node has not received an update for the client lease for a long time
, the client is considered to have crashed and the name node tries to close the file. If the client recovers from the crash and attempts to continue the unfinished write file operation, this time
Recoverlease () is used to restore the lease (file path with lease recovery), and if the method returns true indicating that the file has been successfully closed, the client can pass the append ()
Open the file and continue writing the data.
When the name node crashes, the client creates the file or appends the file by appending it, the name node logs the changes to the edit log in the namespace, the name node according to the log,
Restores the lease information on the name node.
6.2.2 the interface between the various servers in HDFs
1. Datanodeprotocol for communication between data node and name node
When the data node is initialized, the currently stored block of data is communicated to the name node, and the data node continues to update the name node to provide local data blocks in the subsequent process.
Change information and accept instructions from the name node to create, move, or delete data blocks of the local disk
The Handshake (versionrequest () checks the buildversion of the name node and data node), registers (resister () provides node identification and storage System information for the data node),
The data Block escalation (Blockreport () reports all the block information it manages to help the name node establish a mapping between the data block and the data node),
Heartbeat (Sendheartbeat () In addition to the information that carries the token identity also includes information about the current operation, the name node returns the Datanodecommand array, and the command that brings the name node)
Data blocks are stored on data nodes, and for various reasons data block corruption, HDFS using cyclic redundancy check for error detection (three cases will be verified, the data node receives data storage data,
When the client reads data on the data node, the data node periodically scans the data block, and the validation error is escalated to the name node via Reportbadblocks ().
2, Interdatanodeprotocol
Ways to provide data recovery
3, Namenodeprotocol
Provides the method Getblocks (), The equalizer can obtain a data node on a series of data blocks and locations, according to these return values, the equalizer can move the data block from the data node to other data
node, which balances data blocks of each data node
Geteditlogssize () can get the size of the edit log on the name node, and if the edit log reaches a certain size, the second node notifies the name node via the Rolleditlog () method
Begins the merge process, when the name node stops using the current edit log and enables the new log file to facilitate the second-name node through the HTTP-based streaming interface
Gets the namespace image and the mirror edit log to be merged. Rolleditlog () returns a merge checkpoint. When the merge is complete, the second-name node uploads the new
Metadata mirroring, eventually completing a meta-data merge.
6.3.1
Non-IPC interfaces for data nodes
The interface of the HDFS data to read and write Linux local files is based on TCP rather than IPC interface, which facilitates batch processing of data and improves data throughput. In addition to data block reads and writes, the data node
A TCP-based interface such as block substitution, block copy, and data block check information reading is provided.
(1) reading data
(2) Write data Hadoop file system implements the data flow pipeline, when the client sends the data, the data is sent to the first Data node, and then the first data node is locally
Save the data while pushing the data to data node 2 until the last data node in the pipeline, confirming that the packet was generated by the last data node, and upstream toward the client direction
Loopback, the data node along the way to confirm the successful writing, only upstream delivery response
6.3. Non-IPC interface on 2 byte point and second byte point
PS Service-Oriented Architecture (SOA), common HTTP request methods are Get,post,head,put,delete
The HTTP protocol and get method are used between the Hadoop 1.x name node and the second name node


The Namenode single point of failure problem was solved in Hadoop 2.x, while Secondaryname was not used, and the previous Hadoop 1.x was Secondaryname
To combine fsimage and edits to reduce the size of the edits file, reducing the time to restart the Namenode. And in Hadoop 2.x, you don't have to Secondaryname,
So how did it come to realize the merger of Fsimage and edits? First we need to know that the HA mechanism (solving namenode single point of failure) is provided in Hadoop 2.x to
By configuring an odd number of journalnode to implement HA, how to configure today will not talk about! Ha mechanism by running in the same cluster
Two nn (active NN & Standbynn) to solve the Namenode single point of failure, at any time, only one machine is in active state, the other machine is in
Standby status. The active NN is responsible for the operations of all the clients in the cluster, while the standby nn is mainly used for backup, it mainly maintains sufficient state, if necessary, can
Provides fast recovery of failures.
In order to keep the state of the standby nn synchronized with the active NN, that is, the metadata remains consistent, and they will communicate with the Journalnodes daemon.
When the active NN performs any modification of the namespace, it needs to be persisted to more than half of the journalnodes (persisted storage via edits log),
The standby nn is responsible for observing changes in the edits log, which reads edits information from the JNS and updates its internal namespace.
Once the active NN fails, the Standby NN will ensure that all edits are read from the JNS and then switched to the active state.
Standby nn reads all edits to ensure that the namespace State is fully synchronized with the active NN before failover occurs




So how does this mechanism implement the merger of Fsimage and edits? A thread called Checkpointerthread is always running on the standby Namenode node,
This thread calls the DoWork () function of the Standbycheckpointer class, and the DoWork function will every
Math.min (Checkpointcheckperiod, checkpointperiod) seconds to sit a merge operation
The steps can be categorized as follows:
(1), after configuring Ha, all the client update operations will be written to the Journalnodes node's shared directory, can be configured by the following
(2), Active Namenode and standby namenode from journalnodes edits shared directory edits to their edits directory;
(3), Standby Namenode in the Standbycheckpointer class will periodically check whether the merger conditions are set up, if the establishment will be merged fsimage and edits documents;
(4), Standby Namenode in the Standbycheckpointer class after the merger, the merged Fsimage uploaded to the active namenode the corresponding directory;
(5), the Active Namenode received the latest Fsimage file, the old fsimage and edits files are cleaned out;
(6), through the above steps, Fsimage and edits files completed the merger, due to the HA mechanism, will make standby Namenode and active Namenode have the latest fsimage and edits files (before Hadoop 1. Fsimage and edits in Secondarynamenode X are not up-to-date)
6.4 HDFs Main Flow
6.4.1 client-to-name node file and directory operations
A large number of meta-data operations, such as Rename,mkdir, for client-to-moniker nodes, typically involve only the interaction of the client and the name node, through ClientProtocol
For When the client invokes the filesystem instance of HDFs, which is the mkdir () method of Distributedfilesystem, the Distributedfilesystem object is passed through the IPC
Call the remote Method mkdir () on the name node to have the name node perform a specific create subdirectory operation to create a new directory node at the corresponding location on the directory tree data structure
While recording this operation persisted to the log, after the successful execution of the method, MkDir () returns True, during which both the client and the name node do not need to interact with the data node.
Add copies of files and delete files on HDFs. Take the client delete HDFs file as an example, when the name node executes the delete () method, it only marks the operation involving the need to be deleted
Data block, the delete operation is also recorded and persisted to the log without actively contacting the data node that holds the data blocks and deleting the data immediately. When these data blocks are stored, the
When the data node sends a heartbeat, in the heartbeat response, the name node deletes the data through the Datanodecommand Command data node.
PS: The data of the deleted file, that is, the data block corresponding to the file, will be deleted after the deletion operation is completed, and the name node and data node will be maintained forever.
With a simple master-slave structure, the name node does not initiate any IPC calls to the data node, and the data node needs to perform operations with the name node, all through the data node heartbeat response
Returns the Datanodecommand array that is carried in the


6.4.2 Client Read File
The client opens the file via Filesystem.open (), the corresponding HDFs specific file system, Distributedfilesystem creates the output stream Fsdatainputstream, returns
Client. For HDFs, the specific input stream is dfsinputstream, and the output stream instance invokes the name via the clientprotocol.getblocklocations () remote interface
node to determine where to save the data block at the beginning of the file, and for each block in the file, the name node returns the address of the data node that holds the copy of the block. These data nodes
Based on their distance to the client (using topology information from the network), a simple sort is performed.
When the client calls the Fsdatainputstream.read () method to read the file data, the Dfsinputstream object passes through the "read data" stream interface to the data node, and the most recent
Connect data nodes. The client repeatedly calls the read () method, and the data is returned to the client through data packets on the data node and the client connection. When the end of the block is reached,
The dfsinputstream closes the connection to the data node and obtains the data node information that holds the next block of data through the getblocklocations () remote method (the object has no
The remote method is used to cache the location of the data block, and then continue to find the best data node, again through the data node's read data interface to obtain data.
Client reads a file if an error occurs, such as a node outage or a network failure, the client tries the next block location. And remember that node that failed,
No futile attempt was made. Read the data in the reply packet, not only contains the data, but also contains the checksum of the data, the client will check the consistency of the data, if the checksum
Error, it is said that the data block is corrupt, it will report this information to the name node, while attempting to read from another data node the file content of the other copy.
The client directly contacts the name node, retrieves the data storage location, and has the name node to arrange the data node reading order, the advantage is that the ability to read the file caused by the data transfer
Scattered across data nodes, HDFs can support a large number of concurrent clients. At the same time, the name node only handles data block location requests and does not provide data.
6.4.3 Client Write file
The client calls the Create () method of Distributedfilesystem and creates the file, Distributedfilesystem creates the Dfsoutputstream and is called by the remote procedure
, let the name node execute a method of the same name, creating a new file in the file system's namespace. When a name node creates a new file, it needs to perform a variety of checks to complete
After the name node builds a new file and records the creation action into the edit log edits, the remote method call ends Distributedfilesystem the Dfsoutputstream
The object is wrapped in a Fsdataoutputstream instance and returned to the client.
When the client writes data, an empty file is created because of the Create () call, so the Dfsoutputstream instance first needs to request a data block from the name node, Addblock ()

When the method executes successfully, it returns a Locatedblock object. The object contains the block ID and version number of the new block, and its member variable locatedblock.locks

Provides the information of the data flow pipeline, through the above information, Dfsoutputstream can contact the data node, through the writing data interface to establish the data flow pipeline. Client Write
The data in the Fsdataoutputstream stream is divided into a single package that is placed inside the Dfsoutputstream object's internal queue. The packages in this queue are finally packaged as
Packets, which are sent to the data flow pipeline, flow through each data node on the pipeline and persist, confirming that the packet is upstream, from the data pipeline to the client, and when the client receives the reply,
It removes the corresponding package from the internal queue.
Dfsoutputstream after writing a data block, the nodes on the data flow pipeline pass the blockreceived () method of the Datanodeprotocol remote interface to the name node
Submits a block of data to the name node. If there is data waiting to be output in the data queue, the Dfsoutputstream object needs to call the Addblock () method again to add new data to the file
Block. After the client finishes writing the data, call the Close () method to close the stream, and when the packages in the Dfsoutputstream data queue receive an answer, you can use Clientprotocol.complete ()
Method notifies the name node to close the file and complete the file write process.
If the data node fails during file data writing, the following actions are performed: First the data pipeline is shut down, it has been sent to the data pipeline, but you have not received a confirmation package
Package is re-added to the Dfsoutputstream output queue. The data block on the currently working data node is given a new version number, and the name node is notified
After the failed data node recovers from the failure, the data block with only a subset of the data will be deleted due to the inconsistency between the block version and the version number saved by the name node.
Then, delete the error data node in the data flow pipeline and re-establish the pipeline, and write the data normally to the working data node. When the file is closed, the name node discovers the data block's
The number of replicas does not meet the requirements, a new data node is selected and a data block is copied to create a new copy. Data node failures only affect the write operation of a block of data, and subsequent blocks of data
Writes are not affected. There may be more than one data node failure in the data block write process, as long as the number of data nodes in the data pipeline satisfies the configuration item ${dfs.replication.min}
Value (the default value is 1), the write operation is considered to be successful.





































Chapter Sixth HDFS Overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.