Sequencefile and mapfile of HDFS

Source: Internet
Author: User

Hadoop's HDFS and mapreduce sub-frameworks are mainly designed for big data files. The processing of small files is not only inefficient, but also consumes a lot of memory resources (each small file occupies a block, metadata of each block is stored in the memory of namenode ). The solution is to select a container to organize these small files for Unified Storage. HDFS provides two types of containers: sequencefile and mapfile.

I. sequencefile

The storage of sequencefile is similar to a log file. The difference is that each record of log file is plain text data, and each record of sequencefile is a serializable character array.

Sequencefile can use the following API to add a new record:

Filewriter. append (Key, value)

We can see that each record is organized as a key-value pair, provided that the key and value must have the serialization and deserialization functions.

Hadoop predefines some key class and value class, which directly or indirectly implement the writable interface to meet this function, including:

Text is equivalent to a string in Java.
Intwritable is equivalent to int in Java
Booleanwritable is equivalent to boolean in Java
.
.

In terms of storage structure, sequencefile consists of one header followed by multiple records ,:

The header mainly contains key classname, value classname, storage compression algorithm, user-defined metadata, and other information. In addition, it also contains some synchronization identifiers used to quickly locate the record boundary.

Each record is stored as a key-value pair, which indicates that its character array can be parsed as follows: record length, key length, key value, and value, the structure of the value depends on whether the record is compressed.

Data Compression can save disk space and accelerate network transmission. seqeuncefile supports two formats of Data Compression: Record compression and block compression.

Shows the record compression, which compresses the values of each record.

Block compression organizes a series of records into one block ,:

Block information is mainly stored: the number of records contained in the block, the set of key lengths of each record, the set of key values of each record, the set of value lengths of each record, and the set of value values of each record

Note: The size of each block can be specified through the IO. seqfile. Compress. blocksize attribute.

Example: sequencefile read/write operations

Configuration conf = new configuration (); filesystem FS = filesystem. get (CONF); Path seqfile = New Path ("seqfile. SEQ "); // reader internal class is used to read sequencefile files. reader reader = new sequencefile. reader (FS, seqfile, conf); // writer internal class is used for file write operations, assuming that both the key and value are of the text type sequencefile. writer writer = new sequencefile. writer (FS, Conf, seqfile, text. class, text. class); // write the record writer to the document through writer. append (new text ("key"), new text ("value"); ioutils. closestream (writer); // close the write stream // read the record text key = new text (); text value = new text (); While (reader. next (Key, value) {system. out. println (key); system. out. println (value);} ioutils. closestream (Reader); // close the read stream
Ii. mapfile

Mapfile is a sorted sequencefile. By observing its directory structure, we can see that mapfile consists of two parts: Data and index.

As the data index of a file, index records the key value of each record and the offset position of the record in the file. When mapfile is accessed, the index file will be loaded to the memory, and the location of the file where the specified record is located can be quickly located through the index ing relationship. Therefore, compared with sequencefile, the Retrieval Efficiency of mapfile is efficient, but it consumes some memory to store index data.

Note that mapfile does not record all records to the index. By default, an index ing is stored for every 128 records. Of course, the record interval can be modified by using the setindexinterval () method of mapfile. Writer, or modifying the IO. Map. Index. interval attribute;

In addition, unlike sequencefile, mapfile's keyclass must implement the writablecomparable interface, that is, the key value can be compared.

Example: mapfile read/write operations

Configuration conf = new configuration (); filesystem FS = filesystem. get (CONF); Path mapfile = New Path ("mapfile. map "); // reader internal class is used to read mapfile files. reader reader = new mapfile. reader (FS, mapfile. tostring (), conf); // writer internal class is used for file write operations, assuming that both the key and value are of the text type mapfile. writer writer = new mapfile. writer (Conf, FS, mapfile. tostring (), text. class, text. class); // write the record writer to the document through writer. append (new text ("key"), new text ("value"); ioutils. closestream (writer); // close the write stream // read the record text key = new text (); text value = new text (); While (reader. next (Key, value) {system. out. println (key); system. out. println (key);} ioutils. closestream (Reader); // close the read stream

Note: Although mapfile or sequencefile can solve the storage problem of small and medium-sized files in HDFS, it also has some limitations, such:
1. The file does not support the rewrite operation. You cannot append storage records to an existing sequencefile (mapfile ).
2. When the write stream is not closed, there is no way to construct the read stream. That is, when the file write operation is executed, the file cannot be read.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.