Hadoop Learning notes: A brief analysis of Hadoop file system

Source: Internet
Author: User
Tags comparison table

1. What is a distributed file system?

A file system that is stored across multiple computers in a management network is called a distributed file system.

2. Why do I need a distributed file system?

The simple reason is that when the size of a dataset exceeds the storage capacity of a single physical computer, it is necessary to partition it (partition) and store it on several separate computers.

3. Distributed systems are more complex than traditional file systems

Because the Distributed File system architecture is on the network, the distributed system introduces the complexity of network programming, so the Distributed file system is more complex than the ordinary file system.

4.Hadoop the file system

Many children's shoes will equate HDFs with Hadoop's filesystem, in fact Hadoop is a comprehensive file system abstraction, and HDFs is the Hadoop flagship file system, Hadoop, in addition to HDFs can also integrate other file systems. This feature of Hadoop fully embodies the excellent scalability of Hadoop.

In Hadoop, Hadoop defines an abstract file system concept, Specifically, Hadoop defines a Java abstract class: Org.apache.hadoop.fs.FileSystm, an abstract class used to define a filesystem interface in Hadoop, as long as a file system implements this interface, it can be used as a file system supported by Hadoop. Here is the file system that currently implements the Hadoop abstract file class, as shown in the following table:

File system

URI Programme

Java Implement

( Org.apache.hadoop )

Defined

Local

File

Fs. LocalFileSystem

Supports client checksums and local file systems. Local System files with checksums in FS. Implemented in Rawlocalfilesystem.

Hdfs

Hdfs

Hdfs. Distributionfilesystem

Distributed file system for Hadoop.

Hftp

Hftp

Hdfs. Hftpfilesystem

Supports read-only access via HTTP hdfs,distcp often used to replicate data between different HDFs clusters .

Hsftp

Hsftp

Hdfs. Hsftpfilesystem

Enables access to HDFs in a read-only manner over HTTPS .

HAR

Har

Fs. Harfilesystem

Build on top of the Hadoop file system to archive files. Hadoop archive files are primarily used to reduce namenode memory usage .

KFS

Kfs

Fs.kfs.KosmosFileSystem

Cloudstore (formerly known as the Kosmos file system) file system is a GFS file system similar to HDFs and Google, written in C + +.

Ftp

Ftp

Fs.ftp.FtpFileSystem

File systems supported by the FTP server.

S3 (local)

S3n

Fs.s3native.NativeS3FileSystem

Amazon S3-based file system.

S3 (block based)

S3

Fs.s3.NativeS3FileSystem

Based on the Amazon S3 file system, the S3 5GB file size limit is addressed in block format storage.

Finally, I would like to emphasize that there is a filesystem concept in Hadoop, such as the FileSystem abstract class above, which is located in the common project of Hadoop, primarily defining a set of distributed file systems and common I/O components and interfaces. Hadoop's file system is exactly what it should be called Hadoop I/O. HDFs is a distributed file project with Hadoop that implements the file interface, and HDFs is the implementation of the Hadoop I/O interface.

Here I show you a table, so that everyone on the Hadoop filesystem in the relevant API operation is more clear, the table is as follows:

Hadoop of the FileSystem

Java Operation

Linux Operation

Describe

Url.opensteam

Filesystem.open

Filesystem.create

Filesystem.append

Url.openstream

Open

Open a file

Fsdatainputstream.read

Inputsteam.read

Read

Reading data from a file

Fsdataoutputstream.write

Outputsteam.write

Write

Writing data to a file

Fsdatainputstream.close

Fsdataoutputstream.close

Inputsteam.close

Outputsteam.close

Close

Close a file

Fsdatainputstream.seek

Randomaccessfile.seek

Lseek

Change file read/write location

Filesystem.getfilestatus

filesystem.get*

file.get*

Stat

Get the properties of a file/directory

filesystem.set*

file.set*

chmod, etc.

Changing the properties of a file

Filesystem.createnewfile

File.createnewfile

Create

Create a file

Filesystem.delete

File.delete

Remove

Delete a file from the file system

Filesystem.rename

File.renameto

Rename

Change File/directory name

Filesystem.mkdirs

File.mkdir

Mkdir

Create a subdirectory under a given directory

Filesystem.delete

File.delete

RmDir

Remove an empty subdirectory from a directory

Filesystem.liststatus

File.list

Readdir

Read an item in a directory

Filesystem.getworkingdirectory

Getcwd/getwd

Return to current working directory

Filesystem.setworkingdirectory

ChDir

Change the current working directory

With this form, everyone should understand filesystem more clearly.

As you can see from the comparison table, there are two classes in Hadoop's filesystem: Fsdatainputstream and Fsdataoutputstream classes, which are equivalent to InputStream and Outputsteam in Java I/O In fact, these two classes are inherited Java.io.DataInputStream and Java.io.DataOutputStream.

As for the Hadoop I/O This article does not introduce today, later may be devoted to writing an article about my own understanding, but in order to give everyone a clear impression, I found two articles in the blog Park, interested in children's shoes can take a good look at the connection as follows:

Http://www.cnblogs.com/xuqiang/archive/2011/06/03/2042526.html

Http://www.cnblogs.com/xia520pi/archive/2012/05/28/2520813.html

5. Completeness of data

Data integrity is the technique of detecting data corruption. Hadoop users certainly want the system to store and process data without any loss or corruption of the data, although every I/O operation on the disk or network is unlikely to introduce errors into the data that it is reading and writing, but if the amount of data that the system needs to process is large enough to the limits that Hadoop can handle, The probability of the data being corrupted is very high. Hadoop introduces the ability to verify data integrity, and I'll describe the following principles:

The measure of whether the data is damaged is to calculate the checksum (checksum) when the data is first introduced to the system, and to calculate the checksum again when the data is transmitted through an unreliable channel, so that the data is corrupted, and if the checksum of the two calculations does not match, you think the data is corrupted. However, the technology cannot fix the data, it can only detect errors. Commonly used error detection code is CRC-32 (cyclic redundancy check), any size of the data input is calculated by a 32-bit integer checksum.

6. Compression and input shards

File compression has two major benefits: one is to reduce the amount of disk space required to store files, and the other is to accelerate the transmission of data over the network and on disk. These two benefits become quite important for Hadoop, which handles massive amounts of data, so understanding the compression of Hadoop is necessary, and the following table lists the compression formats supported by Hadoop, such as the following table:

Compression format

Tools

Algorithm

File name extension

Multiple files

Severability

DEFLATE

No

DEFLATE

. Deflate

No

No

Gzip

Gzip

DEFLATE

. gz

No

No

Zip

Zip

DEFLATE

. zip

Is

Yes, within the scope of the file

Bzip2

Bzip2

Bzip2

. bz2

No

Is

LZO

Lzop

LZO

. Lzo

No

Is

In Hadoop it is important to have two metrics for compression one is the compression rate and the compression speed, and the following table lists the performance of some compression formats in this respect, as follows:

Compression algorithm

Original File size

Compressed file size

Compression speed

Decompression speed

Gzip

8.3GB

1.8GB

17.5mb/s

58mb/s

Bzip2

8.3GB

1.1GB

2.4mb/s

9.5mb/s

Lzo-bset

8.3GB

2GB

4mb/s

60.6mb/s

LZO

8.3GB

2.9GB

49.3mb/s

74.6mb/s

In Hadoop support compression, it is also important to support the features of the sharding (splitting) file, which I'll tell about the sharding problem, which is the input shard I wrote in the title:

Whether the compression format can be segmented is for the mapreduce processing data, such as we have a compressed 1GB file, if the HDFs block size is set to (HDFs block my article does not explain, do not understand the children's shoes can first check Baidu, I will focus on this when I write HDFs) 64MB, then this file will be stored in 16 blocks, if the file as input data for the MapReduce job, MapReduce will be based on these 16 data blocks, generate 16 map operations, Each block is an input to one of the map operations, and the MapReduce execution is very efficient, but the premise is that the compression format supports sharding. If the compression format does not support slicing, then MapReduce can also make the correct processing, this time it will be 16 pieces of data into a map task inside, when the number of map task is less, the size of the job becomes larger, then the execution efficiency will be greatly reduced.

Due to my knowledge is still limited, about the compression and cut into the shard of the problem I will tell here, the following provides a related article, interested in children's shoes can see, links are as follows:

Http://www.cnblogs.com/ggjucheng/archive/2012/04/22/2465580.html

7.hadoop Serialization of

Let's take a look at two definitions first:

serialization : Refers to converting a structured object into a stream of bytes for permanent storage on the network or on disk.

deserialization : Refers to the inverse process of moving a byte stream to a structured object.

Serialization often occurs in large areas of distributed data processing: Process communication and persistent storage.

In Hadoop, the communication of each node is implemented by a remote call (RPC), which serializes the data into binary and sends it to the remote node, which deserializes the binary byte stream back into the original data after the remote node receives the data. Serialization has its own characteristics in RPC applications, and RPC serialization is characterized by:

    1. Compact: Compact format allows us to take full advantage of network bandwidth, which is the most scarce resource in the data center;
    2. Fast: Process communication forms the skeleton of a distributed system, so it is essential to minimize the performance overhead of serialization and deserialization.
    3. Extensible: Protocol in order to meet the new requirements change, so control client and server process, need to directly introduce the corresponding protocol, these things new protocol, the original serialization method can support the protocol message
    4. Interoperability: Enables client and server-side interaction in different languages

In Hadoop there is a serialized format of its own definition: writable, which is one of the core of Hadoop.

Writable is an interface that is implemented to enable the serialization of Hadoop. Because of the time, serialization I also do not expand, I also recommend an article, which tells the Hadoop serialization, although the simple point, and not comprehensive, but after reading the implementation of the Hadoop serialization will have a preliminary understanding, the link is as follows:

http://blog.csdn.net/a15039096218/article/details/7591072

Excerpted from http://www.cnblogs.com/sharpxiajun/archive/2013/06/15/3137765.html

Hadoop Learning notes: A brief analysis of Hadoop file system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.