Hadoop learning notes: Analysis of hadoop File System

Source: Internet
Author: User

1. What is a distributed file system?

A file system stored across multiple computers in a management network is called a distributed file system.

2. Why do we need a distributed file system?

The reason is simple. When the data set size exceeds the storage capacity of an independent physical computer, it is necessary to partition it and store it on several independent computers.

3. distributed systems are more complex than traditional file systems

Because the Distributed File System architecture is above the network, the complexity of network programming is introduced in the distributed system, so the distributed file system is more complex than the common file system.

4. hadoopFile System

HDFS is equivalent to hadoop's file system. In fact, hadoop is a comprehensive file system abstraction, while HDFS is hadoop's flagship file system. In addition to HDFS, hadoop can integrate other file systems. This feature fully reflects hadoop's excellent scalability.

In hadoop, hadoop defines the concept of an abstract file system. Specifically, hadoop defines a Java Abstract class: Org. apache. hadoop. FS. file1_m: This abstract class is used to define a file system interface in hadoop. As long as a file system implements this interface, it can be used as a file system supported by hadoop. The following table lists the file systems that currently implement the hadoop abstract File class:

 

File System

UriSolution

JavaImplementation

(Org. Apache. hadoop)

Definition

Local

File

FS. localfilesystem

Supports client checksum and local file system. Local system files with checksum are implemented in FS. rawlocalfilesystem.

HDFS

HDFS

HDFS. distributionfilesystem

Hadoop distributed file system.

Hftp

Hftp

HDFS. hftpfilesystem

SupportedHTTPMethodRead-OnlyDistcp is often used to access HDFS.DifferentHDFSBetween ClustersCopy data.

Hsftp

Hsftp

HDFS. hsftpfilesystem

SupportedHTTPSMethodRead-OnlyTo access HDFS.

Har

Har

FS. harfilesystem

Built on the hadoop File System to archive files. Hadoop archive files are mainly usedDecreaseNamenodeMemory usage.

KFS

KFS

FS. KFS. kosmosfilesystem

The file system of cloudstore (its predecessor is the Kosmos File System) isSimilarHDFS and Google's GFS file systems are written in C ++.

FTP

FTP

FS. FTP. ftpfilesystem

The file system supported by the FTP server.

S3 (local)

S3n

FS. s3native. natives3filesystem

Amazon S3-based file systems.

S3 (Block-based)

S3

FS. s3.natives3filesystem

Amazon S3-based file systems are stored in block format to address S3's 5 GB file size limit.

Finally, I want to emphasize that there is a file system concept in hadoop, such as the above filesystem abstract class, which is located in the hadoop common project, it mainly defines a group of distributed file systems and general I/O components and interfaces. The correct File System of hadoop should be called hadoop I/O. HDFS is a distributed file project that comes with hadoop to implement this file interface, and HDFS is an implementation of hadoop I/O interfaces.

Next I will show you a table, so that you can perform operations on the APIs in hadoop's filesystem clearly. The table is as follows:

hadoop filesystem

JAVA operation

Linux operation

description

URL. opensteam

filesystem. open

filesystem. create

filesystem. append

URL. openstream

open

open a file

Fsdatainputstream. Read

Inputsteam. Read

Read

Read data from a file

Fsdataoutputstream. Write

Outputsteam. Write

Write

Write data to a file

Fsdatainputstream. Close

Fsdataoutputstream. Close

Inputsteam. Close

Outputsteam. Close

Close

Close a file

Fsdatainputstream. Seek

Randomaccessfile. Seek

Lseek

Change the file read/write location

Filesystem. getfilestatus

Filesystem. Get *

File. Get *

Stat

Get File/directory attributes

Filesystem. Set *

File. Set *

Chmod, etc.

Modifying file attributes

Filesystem. createnewfile

File. createnewfile

Create

Create a file

Filesystem. Delete

File. Delete

Remove

Delete a file from the file system

Filesystem. Rename

File. renameto

Rename

Change file/directory name

Filesystem. mkdirs

File. mkdir

Mkdir

Create a sub-directory under a given directory

Filesystem. Delete

File. Delete

Rmdir

Delete an empty subdirectory from a directory

Filesystem. liststatus

File. List

Readdir

Read the project under a directory

Filesystem. getworkingdirectory

 

Getcwd/getwd

Back to current working directory

Filesystem. setworkingdirectory

 

Chdir

Change current working directory

With this table, you should have a much clearer understanding of filesystem.

You can see from the table that hadoop's filesystem has two classes: fsdatainputstream and fsdataoutputstream, which are equivalent to inputstream and outputsteam in Java I/O, in fact, these two classes inherit Java. io. datainputstream and Java. io. dataoutputstream.

As for hadoop I/O, this article will not be introduced today. I may write a special article in the future.ArticleLet me talk about my own understanding. But to give you a clear impression, I found two articles in my blog. If you are interested, you can see the following links:

Http://www.cnblogs.com/xuqiang/archive/2011/06/03/2042526.html

Http://www.cnblogs.com/xia520pi/archive/2012/05/28/2520813.html

5. Data Integrity

Data integrity is the technology that checks whether data is damaged. Hadoop users certainly want the system to store and process data without any loss or damage, although each I/O operation on the disk or network is unlikely to introduce errors into the data that you are reading and writing, if the system needs to process a large amount of data to the limit that hadoop can handle, the probability of data corruption is high. Hadoop introduces the data integrity verification function. I will describe its principles as follows:

The measure to check whether data is damaged is to calculate the checksum when the data is introduced to the system for the first time, and calculate the checksum again when the data is transmitted through an unreliable channel, in this way, you can see whether the data is damaged. If the two calculation checksum does not match, you think the data is damaged. However, this technology cannot repair the data and can only detect errors. Common error detection code is CRC-32 (cyclic redundancy check), any size of data input is calculated to get a 32-bit integer checksum.

6. Compress and input parts

File compression has two benefits: one is to reduce the disk space required for storing files, and the other is to accelerate data transmission over the network and disk. For hadoop processing massive data, these two benefits become very important, so it is necessary to understand hadoop compression. The following table lists the compression formats supported by hadoop, as shown in the following table:

Compression format

Tools

 Algorithm

File Extension

Multiple files

Severability

Deflate

None

Deflate

. Deflate

No

No

Gzip

Gzip

Deflate

. GZ

No

No

Zip

Zip

Deflate

. Zip

Yes

Yes, within the file range

Bzip2

Bzip2

Bzip2

. Bz2

No

Yes

Lzo

Lzop

Lzo

. Lzo

No

Yes

 

 

 

 

 

 

 

 

 

 

Hadoop has two important indicators for compression: compression ratio and compression speed. The following table lists the performance of some compression formats in this aspect, as shown below:

Compression Algorithm

Original file size

Size of the compressed file

Compression speed

Decompression speed

Gzip

8.3 GB

1.8 GB

17.5 MB/S

58 MB/S

Bzip2

8.3 GB

1.1 GB

2.4 MB/S

9.5 MB/S

Lzo-bset

8.3 GB

2 GB

4 MB/S

60.6 MB/S

Lzo

8.3 GB

2.9 GB

49.3 MB/S

74.6 MB/S

 

 

 

 

 

 

 

 

In hadoop's support for compression, it is also very important to determine whether to support splitting (splitting) files. Next I will discuss the splitting problem, that is, the question of input sharding in my title:

Whether the compression format can be split is for mapreduce to process data. For example, we have a file compressed to 1 GB, if the HDFS block size is set to 64 MB (I have not explained the HDFS block in my article. If I don't understand it, I can check Baidu first, and I will focus on it later when writing HDFS, this file will be stored in 16 blocks. If this file is used as the input data of mapreduce jobs, mapreduce will generate 16 map operations based on these 16 data blocks, each block is the input of one map operation, so the mapreduce execution efficiency is very high, but the premise is that the compression format must support splitting. If the compression format does not support splitting, mapreduce can also make the correct processing. At this time, it will put 16 data blocks into a map task. At this time, there are fewer map tasks, as the job granularity increases, the execution efficiency will be greatly reduced.

As my knowledge is still limited, I will talk about the compression and cut-in parts here. The following is a related article. If you are interested in shoes, please refer to the link below:

Http://www.cnblogs.com/ggjucheng/archive/2012/04/22/2465580.html

7. hadoopSerialization

Let's first look at two definitions:

Serialization: Converts a structured object to a byte stream for permanent storage over the network or written to a disk.

Deserialization: Refers to the inverse process of converting byte streams to structured objects.

Serialization often occurs in areas with large amounts of distributed data processing: Process Communication and permanent storage.

In hadoop, communication between nodes is implemented through Remote Call (RPC). RPC serializes the data into binary and sends it to the remote node, after the remote node receives the data, it deserializes the binary byte stream into the original data. Serialization has its own characteristics in RPC applications. The characteristics of RPC serialization are as follows:

    1. Compact: the compact format enables us to make full use of network bandwidth, which is the most scarce resource in the data center;
    2. Fast: Process Communication forms the skeleton of a distributed system. Therefore, we need to minimize the performance overhead of serialization and deserialization. This is basic.
    3. Scalability: in order to meet new requirements and changes, the Protocol needs to be directly introduced to control the client and server processes. These new protocols support the original serialization method to support the original protocol packets.
    4. Interoperability: supports interaction between clients and servers written in different languages

In hadoop, there is a custom serialization format: writable, which is one of the core of hadoop.

Writable is an interface. To implement hadoop serialization, you must implement this interface. I will not start serialization because of time. I also recommend an article that describes hadoop serialization. Although it is simple and not comprehensive, however, after reading this article, we will have a preliminary understanding of the specific implementation of hadoop serialization. The link is as follows:

Http://blog.csdn.net/a15039096218/article/details/7591072

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.