Hadoop Learning notes: A brief analysis of Hadoop file system

Last Update:2015-06-15 Source: Internet

Author: User

Tags compact comparison table

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is a distributed file system?

A file system that is stored across multiple computers in a management network is called a distributed file system.

2. Why do I need a distributed file system?

The simple reason is that when the size of a dataset exceeds the storage capacity of a single physical computer, it is necessary to partition it (partition) and store it on several separate computers.

3. Distributed systems are more complex than traditional file systems

Because the Distributed File system architecture is on the network, the distributed system introduces the complexity of network programming, so the Distributed file system is more complex than the ordinary file system.

4.Hadoop the file system

Many children's shoes will equate HDFs with Hadoop's filesystem, in fact Hadoop is a comprehensive file system abstraction, and HDFs is the Hadoop flagship file system, Hadoop, in addition to HDFs can also integrate other file systems. This feature of Hadoop fully embodies the excellent scalability of Hadoop.

In Hadoop, Hadoop defines an abstract file system concept, Specifically, Hadoop defines a Java abstract class: Org.apache.hadoop.fs.FileSystm, an abstract class used to define a filesystem interface in Hadoop, as long as a file system implements this interface, it can be used as a file system supported by Hadoop. Here is the file system that currently implements the Hadoop abstract file class, as shown in the following table:

File system	URI Programme	Java Implement ( Org.apache.hadoop )	Defined
Local	File	Fs. LocalFileSystem	Supports client checksums and local file systems. Local System files with checksums in FS. Implemented in Rawlocalfilesystem.
Hdfs	Hdfs	Hdfs. Distributionfilesystem	Distributed file system for Hadoop.
Hftp	Hftp	Hdfs. Hftpfilesystem	Supports read-only access via HTTP hdfs,distcp often used to replicate data between different HDFs clusters .
Hsftp	Hsftp	Hdfs. Hsftpfilesystem	Enables access to HDFs in a read-only manner over HTTPS .
HAR	Har	Fs. Harfilesystem	Build on top of the Hadoop file system to archive files. Hadoop archive files are primarily used to reduce namenode memory usage .
KFS	Kfs	Fs.kfs.KosmosFileSystem	Cloudstore (formerly known as the Kosmos file system) file system is a GFS file system similar to HDFs and Google, written in C + +.
Ftp	Ftp	Fs.ftp.FtpFileSystem	File systems supported by the FTP server.
S3 (local)	S3n	Fs.s3native.NativeS3FileSystem	Amazon S3-based file system.
S3 (block based)	S3	Fs.s3.NativeS3FileSystem	Based on the Amazon S3 file system, the S3 5GB file size limit is addressed in block format storage.

Finally, I would like to emphasize that there is a filesystem concept in Hadoop, such as the FileSystem abstract class above, which is located in the common project of Hadoop, primarily defining a set of distributed file systems and common I/O components and interfaces. Hadoop's file system is exactly what it should be called Hadoop I/O. HDFs is a distributed file project with Hadoop that implements the file interface, and HDFs is the implementation of the Hadoop I/O interface.

Here I show you a table, so that everyone on the Hadoop filesystem in the relevant API operation is more clear, the table is as follows:

Hadoop of the FileSystem	Java Operation	Linux Operation	Describe
Url.opensteam Filesystem.open Filesystem.create Filesystem.append	Url.openstream	Open	Open a file
Fsdatainputstream.read	Inputsteam.read	Read	Reading data from a file
Fsdataoutputstream.write	Outputsteam.write	Write	Writing data to a file
Fsdatainputstream.close Fsdataoutputstream.close	Inputsteam.close Outputsteam.close	Close	Close a file
Fsdatainputstream.seek	Randomaccessfile.seek	Lseek	Change file read/write location
Filesystem.getfilestatus filesystem.get*	file.get*	Stat	Get the properties of a file/directory
filesystem.set*	file.set*	chmod, etc.	Changing the properties of a file
Filesystem.createnewfile	File.createnewfile	Create	Create a file
Filesystem.delete	File.delete	Remove	Delete a file from the file system
Filesystem.rename	File.renameto	Rename	Change File/directory name
Filesystem.mkdirs	File.mkdir	Mkdir	Create a subdirectory under a given directory
Filesystem.delete	File.delete	RmDir	Remove an empty subdirectory from a directory
Filesystem.liststatus	File.list	Readdir	Read an item in a directory
Filesystem.getworkingdirectory		Getcwd/getwd	Return to current working directory
Filesystem.setworkingdirectory		ChDir	Change the current working directory

With this form, everyone should understand filesystem more clearly.

As you can see from the comparison table, there are two classes in Hadoop's filesystem: Fsdatainputstream and Fsdataoutputstream classes, which are equivalent to InputStream and Outputsteam in Java I/O In fact, these two classes are inherited Java.io.DataInputStream and Java.io.DataOutputStream.

As for the Hadoop I/O This article does not introduce today, later may be devoted to writing an article about my own understanding, but in order to give everyone a clear impression, I found two articles in the blog Park, interested in children's shoes can take a good look at the connection as follows:

Http://www.cnblogs.com/xuqiang/archive/2011/06/03/2042526.html

Http://www.cnblogs.com/xia520pi/archive/2012/05/28/2520813.html

5. Completeness of data

Data integrity is the technique of detecting data corruption. Hadoop users certainly want the system to store and process data without any loss or corruption of the data, although every I/O operation on the disk or network is unlikely to introduce errors into the data that it is reading and writing, but if the amount of data that the system needs to process is large enough to the limits that Hadoop can handle, The probability of the data being corrupted is very high. Hadoop introduces the ability to verify data integrity, and I'll describe the following principles:

The measure of whether the data is damaged is to calculate the checksum (checksum) when the data is first introduced to the system, and to calculate the checksum again when the data is transmitted through an unreliable channel, so that the data is corrupted, and if the checksum of the two calculations does not match, you think the data is corrupted. However, the technology cannot fix the data, it can only detect errors. Commonly used error detection code is CRC-32 (cyclic redundancy check), any size of the data input is calculated by a 32-bit integer checksum.

6. Compression and input shards

File compression has two major benefits: one is to reduce the amount of disk space required to store files, and the other is to accelerate the transmission of data over the network and on disk. These two benefits become quite important for Hadoop, which handles massive amounts of data, so understanding the compression of Hadoop is necessary, and the following table lists the compression formats supported by Hadoop, such as the following table:

Compression format	Tools	Algorithm	File name extension	Multiple files	Severability
DEFLATE	No	DEFLATE	. Deflate	No	No
Gzip	Gzip	DEFLATE	. gz	No	No
Zip	Zip	DEFLATE	. zip	Is	Yes, within the scope of the file
Bzip2	Bzip2	Bzip2	. bz2	No	Is
LZO	Lzop	LZO	. Lzo	No	Is

In Hadoop it is important to have two metrics for compression one is the compression rate and the compression speed, and the following table lists the performance of some compression formats in this respect, as follows:

Compression algorithm	Original File size	Compressed file size	Compression speed	Decompression speed
Gzip	8.3GB	1.8GB	17.5mb/s	58mb/s
Bzip2	8.3GB	1.1GB	2.4mb/s	9.5mb/s
Lzo-bset	8.3GB	2GB	4mb/s	60.6mb/s
LZO	8.3GB	2.9GB	49.3mb/s	74.6mb/s

In Hadoop support compression, it is also important to support the features of the sharding (splitting) file, which I'll tell about the sharding problem, which is the input shard I wrote in the title:

Whether the compression format can be segmented is for the mapreduce processing data, such as we have a compressed 1GB file, if the HDFs block size is set to (HDFs block my article does not explain, do not understand the children's shoes can first check Baidu, I will focus on this when I write HDFs) 64MB, then this file will be stored in 16 blocks, if the file as input data for the MapReduce job, MapReduce will be based on these 16 data blocks, generate 16 map operations, Each block is an input to one of the map operations, and the MapReduce execution is very efficient, but the premise is that the compression format supports sharding. If the compression format does not support slicing, then MapReduce can also make the correct processing, this time it will be 16 pieces of data into a map task inside, when the number of map task is less, the size of the job becomes larger, then the execution efficiency will be greatly reduced.

Due to my knowledge is still limited, about the compression and cut into the shard of the problem I will tell here, the following provides a related article, interested in children's shoes can see, links are as follows:

Http://www.cnblogs.com/ggjucheng/archive/2012/04/22/2465580.html

7.hadoop Serialization of

Let's take a look at two definitions first:

serialization : Refers to converting a structured object into a stream of bytes for permanent storage on the network or on disk.

deserialization : Refers to the inverse process of moving a byte stream to a structured object.

Serialization often occurs in large areas of distributed data processing: Process communication and persistent storage.

In Hadoop, the communication of each node is implemented by a remote call (RPC), which serializes the data into binary and sends it to the remote node, which deserializes the binary byte stream back into the original data after the remote node receives the data. Serialization has its own characteristics in RPC applications, and RPC serialization is characterized by:

Compact: Compact format allows us to take full advantage of network bandwidth, which is the most scarce resource in the data center;
Fast: Process communication forms the skeleton of a distributed system, so it is essential to minimize the performance overhead of serialization and deserialization.
Extensible: Protocol in order to meet the new requirements change, so control client and server process, need to directly introduce the corresponding protocol, these things new protocol, the original serialization method can support the protocol message
Interoperability: Enables client and server-side interaction in different languages

In Hadoop there is a serialized format of its own definition: writable, which is one of the core of Hadoop.

Writable is an interface that is implemented to enable the serialization of Hadoop. Because of the time, serialization I also do not expand, I also recommend an article, which tells the Hadoop serialization, although the simple point, and not comprehensive, but after reading the implementation of the Hadoop serialization will have a preliminary understanding, the link is as follows:

http://blog.csdn.net/a15039096218/article/details/7591072

Hadoop Learning notes: A brief analysis of Hadoop file system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More