1. What is a distributed file system?
A file system that is stored across multiple computers in a management network is called a distributed file system.
2. Why do I need a distributed file system?
The simple reason is that when the size of a dataset exceeds the storage capacity of a single physical computer, it is necessary to partition it (partition) and store it on several separate computers.
3. Distributed systems are more complex than traditional file systems
Because the Distributed File system architecture is on the network, the distributed system introduces the complexity of network programming, so the Distributed file system is more complex than the ordinary file system.
4.Hadoop the file system
Many children's shoes will equate HDFs with Hadoop's filesystem, in fact Hadoop is a comprehensive file system abstraction, and HDFs is the Hadoop flagship file system, Hadoop, in addition to HDFs can also integrate other file systems. This feature of Hadoop fully embodies the excellent scalability of Hadoop.
In Hadoop, Hadoop defines an abstract file system concept, Specifically, Hadoop defines a Java abstract class: Org.apache.hadoop.fs.FileSystm, an abstract class used to define a filesystem interface in Hadoop, as long as a file system implements this interface, it can be used as a file system supported by Hadoop. Here is the file system that currently implements the Hadoop abstract file class, as shown in the following table:
File system |
URI Programme |
Java Implement ( Org.apache.hadoop ) |
Defined |
Local |
File |
Fs. LocalFileSystem |
Supports client checksums and local file systems. Local System files with checksums in FS. Implemented in Rawlocalfilesystem. |
Hdfs |
Hdfs |
Hdfs. Distributionfilesystem |
Distributed file system for Hadoop. |
Hftp |
Hftp |
Hdfs. Hftpfilesystem |
Supports read-only access via HTTP hdfs,distcp often used to replicate data between different HDFs clusters . |
Hsftp |
Hsftp |
Hdfs. Hsftpfilesystem |
Enables access to HDFs in a read-only manner over HTTPS . |
HAR |
Har |
Fs. Harfilesystem |
Build on top of the Hadoop file system to archive files. Hadoop archive files are primarily used to reduce namenode memory usage . |
KFS |
Kfs |
Fs.kfs.KosmosFileSystem |
Cloudstore (formerly known as the Kosmos file system) file system is a GFS file system similar to HDFs and Google, written in C + +. |
Ftp |
Ftp |
Fs.ftp.FtpFileSystem |
File systems supported by the FTP server. |
S3 (local) |
S3n |
Fs.s3native.NativeS3FileSystem |
Amazon S3-based file system. |
S3 (block based) |
S3 |
Fs.s3.NativeS3FileSystem |
Based on the Amazon S3 file system, the S3 5GB file size limit is addressed in block format storage. |
Finally, I would like to emphasize that there is a filesystem concept in Hadoop, such as the FileSystem abstract class above, which is located in the common project of Hadoop, primarily defining a set of distributed file systems and common I/O components and interfaces. Hadoop's file system is exactly what it should be called Hadoop I/O. HDFs is a distributed file project with Hadoop that implements the file interface, and HDFs is the implementation of the Hadoop I/O interface.
Here I show you a table, so that everyone on the Hadoop filesystem in the relevant API operation is more clear, the table is as follows:
Hadoop of the FileSystem |
Java Operation |
Linux Operation |
Describe |
Url.opensteam Filesystem.open Filesystem.create Filesystem.append |
Url.openstream |
Open |
Open a file |
Fsdatainputstream.read |
Inputsteam.read |
Read |
Reading data from a file |
Fsdataoutputstream.write |
Outputsteam.write |
Write |
Writing data to a file |
Fsdatainputstream.close Fsdataoutputstream.close |
Inputsteam.close Outputsteam.close |
Close |
Close a file |
Fsdatainputstream.seek |
Randomaccessfile.seek |
Lseek |
Change file read/write location |
Filesystem.getfilestatus filesystem.get* |
file.get* |
Stat |
Get the properties of a file/directory |
filesystem.set* |
file.set* |
chmod, etc. |
Changing the properties of a file |
Filesystem.createnewfile |
File.createnewfile |
Create |
Create a file |
Filesystem.delete |
File.delete |
Remove |
Delete a file from the file system |
Filesystem.rename |
File.renameto |
Rename |
Change File/directory name |
Filesystem.mkdirs |
File.mkdir |
Mkdir |
Create a subdirectory under a given directory |
Filesystem.delete |
File.delete |
RmDir |
Remove an empty subdirectory from a directory |
Filesystem.liststatus |
File.list |
Readdir |
Read an item in a directory |
Filesystem.getworkingdirectory |
|
Getcwd/getwd |
Return to current working directory |
Filesystem.setworkingdirectory |
|
ChDir |
Change the current working directory |
With this form, everyone should understand filesystem more clearly.
As you can see from the comparison table, there are two classes in Hadoop's filesystem: Fsdatainputstream and Fsdataoutputstream classes, which are equivalent to InputStream and Outputsteam in Java I/O In fact, these two classes are inherited Java.io.DataInputStream and Java.io.DataOutputStream.
As for the Hadoop I/O This article does not introduce today, later may be devoted to writing an article about my own understanding, but in order to give everyone a clear impression, I found two articles in the blog Park, interested in children's shoes can take a good look at the connection as follows:
Http://www.cnblogs.com/xuqiang/archive/2011/06/03/2042526.html
Http://www.cnblogs.com/xia520pi/archive/2012/05/28/2520813.html
5. Completeness of data
Data integrity is the technique of detecting data corruption. Hadoop users certainly want the system to store and process data without any loss or corruption of the data, although every I/O operation on the disk or network is unlikely to introduce errors into the data that it is reading and writing, but if the amount of data that the system needs to process is large enough to the limits that Hadoop can handle, The probability of the data being corrupted is very high. Hadoop introduces the ability to verify data integrity, and I'll describe the following principles:
The measure of whether the data is damaged is to calculate the checksum (checksum) when the data is first introduced to the system, and to calculate the checksum again when the data is transmitted through an unreliable channel, so that the data is corrupted, and if the checksum of the two calculations does not match, you think the data is corrupted. However, the technology cannot fix the data, it can only detect errors. Commonly used error detection code is CRC-32 (cyclic redundancy check), any size of the data input is calculated by a 32-bit integer checksum.
6. Compression and input shards
File compression has two major benefits: one is to reduce the amount of disk space required to store files, and the other is to accelerate the transmission of data over the network and on disk. These two benefits become quite important for Hadoop, which handles massive amounts of data, so understanding the compression of Hadoop is necessary, and the following table lists the compression formats supported by Hadoop, such as the following table:
Compression format |
Tools |
Algorithm |
File name extension |
Multiple files |
Severability |
DEFLATE |
No |
DEFLATE |
. Deflate |
No |
No |
Gzip |
Gzip |
DEFLATE |
. gz |
No |
No |
Zip |
Zip |
DEFLATE |
. zip |
Is |
Yes, within the scope of the file |
Bzip2 |
Bzip2 |
Bzip2 |
. bz2 |
No |
Is |
LZO |
Lzop |
LZO |
. Lzo |
No |
Is |
In Hadoop it is important to have two metrics for compression one is the compression rate and the compression speed, and the following table lists the performance of some compression formats in this respect, as follows:
Compression algorithm |
Original File size |
Compressed file size |
Compression speed |
Decompression speed |
Gzip |
8.3GB |
1.8GB |
17.5mb/s |
58mb/s |
Bzip2 |
8.3GB |
1.1GB |
2.4mb/s |
9.5mb/s |
Lzo-bset |
8.3GB |
2GB |
4mb/s |
60.6mb/s |
LZO |
8.3GB |
2.9GB |
49.3mb/s |
74.6mb/s |
In Hadoop support compression, it is also important to support the features of the sharding (splitting) file, which I'll tell about the sharding problem, which is the input shard I wrote in the title:
Whether the compression format can be segmented is for the mapreduce processing data, such as we have a compressed 1GB file, if the HDFs block size is set to (HDFs block my article does not explain, do not understand the children's shoes can first check Baidu, I will focus on this when I write HDFs) 64MB, then this file will be stored in 16 blocks, if the file as input data for the MapReduce job, MapReduce will be based on these 16 data blocks, generate 16 map operations, Each block is an input to one of the map operations, and the MapReduce execution is very efficient, but the premise is that the compression format supports sharding. If the compression format does not support slicing, then MapReduce can also make the correct processing, this time it will be 16 pieces of data into a map task inside, when the number of map task is less, the size of the job becomes larger, then the execution efficiency will be greatly reduced.
Due to my knowledge is still limited, about the compression and cut into the shard of the problem I will tell here, the following provides a related article, interested in children's shoes can see, links are as follows:
Http://www.cnblogs.com/ggjucheng/archive/2012/04/22/2465580.html
7.hadoop Serialization of
Let's take a look at two definitions first:
serialization : Refers to converting a structured object into a stream of bytes for permanent storage on the network or on disk.
deserialization : Refers to the inverse process of moving a byte stream to a structured object.
Serialization often occurs in large areas of distributed data processing: Process communication and persistent storage.
In Hadoop, the communication of each node is implemented by a remote call (RPC), which serializes the data into binary and sends it to the remote node, which deserializes the binary byte stream back into the original data after the remote node receives the data. Serialization has its own characteristics in RPC applications, and RPC serialization is characterized by:
- Compact: Compact format allows us to take full advantage of network bandwidth, which is the most scarce resource in the data center;
- Fast: Process communication forms the skeleton of a distributed system, so it is essential to minimize the performance overhead of serialization and deserialization.
- Extensible: Protocol in order to meet the new requirements change, so control client and server process, need to directly introduce the corresponding protocol, these things new protocol, the original serialization method can support the protocol message
- Interoperability: Enables client and server-side interaction in different languages
In Hadoop there is a serialized format of its own definition: writable, which is one of the core of Hadoop.
Writable is an interface that is implemented to enable the serialization of Hadoop. Because of the time, serialization I also do not expand, I also recommend an article, which tells the Hadoop serialization, although the simple point, and not comprehensive, but after reading the implementation of the Hadoop serialization will have a preliminary understanding, the link is as follows:
http://blog.csdn.net/a15039096218/article/details/7591072
Excerpted from http://www.cnblogs.com/sharpxiajun/archive/2013/06/15/3137765.html
Hadoop Learning notes: A brief analysis of Hadoop file system