Distributed file systems and a simple understanding of related nodes

Source: Internet
Author: User

Distributed File System

1. the increasing volume of data, which cannot be stored in a range of operating system jurisdictions, is allocated to more operating system-managed disks, but is not easily managed and maintained, so a system is urgently needed to manage the files on multiple machines, resulting in a distributed file system.

2. It is a file system that allows files to be shared across multiple hosts on a network, allowing users on multiple machines to share files and storage space.

3. permeability, let's actually access the file through the network action, by the program and the user, just like access to local disk general.

4. fault tolerance, even if some nodes in the system are offline, the system can continue to operate without loss of data as a whole.

5. There are many distributed file management systems, HDFs is one of them, HDFs is suitable for one-time write multiple queries, does not support the situation of concurrent writes, the operation of small files is not appropriate.

Hdfs:hadoop Distributed File System distributed filesystem.

The architecture of HDFs:

Master-Slave structure

Master node, only one: NameNode

From the node, there are many songs: DataNode

NameNode is the management node for the entire file system. It maintains the entire file directory tree of your system, the meta-information of the file/directory, and a list of the data blocks corresponding to each file. Receives the user's action request.

Documents related to Namenode include:

1.fsimage: Metadata image file, storing Namenode memory metadata information for a certain period of time.

2.edits: Operation log file.

3.fstime: Save the time of the last checkpoint.

Note:

1.NameNode saves metadata in memory, which can quickly handle the client's "read request" to the data, but the data in memory is easily lost, such as a power-down, so we have to have a copy of metadata on the disk.

2. When a "write request" arrives, that is, to change the file system of Hadoop, Namenode will first write Editlog and actively synchronize to disk, after the success of the metadata in memory, and return the corresponding information to the client, The client then writes the data to the corresponding Datanode according to the information.

3.fsimage is the mirror of metadata in Namenode, fsimage is not always consistent with metadata in Namenode, but merges the contents of Editlog at intervals to update, because the merging process consumes memory and CPU. All Hadoop is dedicated to updating fsimage files with Secondarynamenode.

The above files are stored in the Linux file system, the directory is ${hadoop.tmp.dir}/dfs/name/current.

DataNode provides storage services for real-world file data.

1. File Block (block): The most basic storage unit, for the file content, a file is size, then from the file's 0 offset, according to the fixed size, the order of the file is divided and numbered, divided into each block of each piece is called a block, HDFS default block size is 64MB, in a 256MB file, a total of 256/64=4 block.

2. Unlike the ordinary file system, HDFs, if a file is smaller than the size of the data block, does not occupy the entire block storage space, but occupies the actual size of the file space.

3.replication is a copy and is set by the Dfs.replication property in the Hdfs-site.xml file.

Note: The real data is saved in the ${hadoop.tmp.dir}/dfs/data/current directory.

Secondarynamenode

1. is a solution for ha, but does not support hot standby. Configuration.

2. Execution process: Download metadata information (fsimage,edits) from Namenode, then merge the two, generate a new fsimage, save it locally, push it to Namenode, and reset Namenode edits.

3. It is installed on the Namenode node by default, but it is recommended to set up a separate node.

4.secondaryNameNode of work is triggered by practice or size, and configuration can be found in the Core-default.xml file.

The working schematic of the Secondarynamenode is as follows:

Secondarynamenode Detailed Work Flow:

1.SecondaryNameNode notification Primarynamenode switch editlog.

2.SecondaryNameNode obtains Fsimage and Editlog from Primarynamenode via the HTTP protocol.

3.SecondaryNameNode loads the fsimage into memory and then starts merging the Editlog operations.

4.SecondaryNameNode sends the fsimage of the merged new data to Primarynamenode.

5.PrimaryNameNode after receiving Secondarynamenode new fsimage will replace the old fsimage with the new fsimage.

Trigger Points for Secondarynamenode work:

1.FS.CHECKPOINT.PERIOD Specifies the maximum time interval between checkpoint two times, and the default practice is 3,600 seconds, or one hours.

2.FS.CHECKPOINT.SIZE Specifies the maximum value of the Editlog file, the default size of the file is 64M, and once this value is exceeded, the secondarynamenode work is forced to be triggered.

These two properties can be found in the Core-default.xml file and configured in the Core-site.xml file.

Question one: How does hdfs achieve high reliability in hadoop1.x?

For:

1. Configure the Dfs.name.dir property in the Hdfs-site.xml file, which is namenode in metadata information, multiple directories are separated by commas, and HDFS replicates the metadata redundancy to these directories, which are usually on different devices and the nonexistent directories are ignored.

2. Restore Namenode through Secondarynamenode, see http://blog.csdn.net/lzm1340458776/article/details/38820739 for details

3. Using third-party tools Avatarnode

Question two: What happens when I perform the format of HDFs?

Answer: Namenode has created its own directory structure.

Problem three: After multiple format causes Datanode to start, how to do?

A : the value of Namespaceid in ${hadoop.tmp.dir}/dfs/name/current/version and ${hadoop.tmp.dir}/dfs/data/ The value of Namespaceid in version is modified to be consistent.

Distributed file systems and a simple understanding of related nodes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.