In addition to the "normal" file, HDFs introduces a number of specific file types (such as Sequencefile, Mapfile, Setfile, Arrayfile, and bloommapfile) that provide richer functionality and typically simplify data processing.
Sequencefile provides a persistent data structure for binary key/value pairs. Here, the different instances of the key and value must represent the same Java class, but the size can be different. Similar to other Hadoop files, Sequencefile can only append.
When you use a normal (text or binary) file to hold a key/value pair (a typical data structure used by MapReduce), the data store does not know the layout of keys and values, and it must be implemented in a reader on top of universal storage. The use of Sequencefile provides a storage mechanism in which native support key/value structures are used to simplify the implementation of this data layout.
Sequencefile has three available formats: No compression, record compression, and block compression. The first two are saved in a record-based format (as shown in Figure 2-2), while the third uses a block based format (as shown in Figure 2-3).
For a sequence file, the selection of a particular format determines the size of the file on the hard disk drive. Block-compressed files are usually the smallest, without the largest compression.
In Figure 2-2 and figure 2-3, the header (header) contains general information about Sequencefile, as shown in table 2-1.
Table 2-1 Sequencefile Header (header)
Field describes version a 4-byte array that contains three characters (SEQ) and a sequence file version number (4 or 6). Currently using version 6. The class name that supports the version 4Key class key for backward compatibility, uses the class name of the key provided by the reader to validate the value class, validates it with the class name of the value provided by the reader compression key/value compression flag block Compression block compression flag compression Codeccompressioncodec class. The class is used only if the key/value or block compression flag is true. Otherwise, the value is ignored metadata metadata (optional) is a list of key/value pairs that can be used to add user-specific information syncsync identifiers to a file
Note:
Sync (sync) is a special tag used to find more quickly within Sequencefile. The sync tag also has a special function in the MapReduce implementation--data segmentation can only be performed at the sync boundary.
As shown in table 2-2, the records contain the actual data for the keys and values, and their length.
Table 2-2 Record layout
The
field describes the length (bytes) of the value of the record length (bytes) Key byte array, the key value byte array containing the record, and the value of the record
In a block-based scenario, header and sync serve the same purpose (as in the case of a sequencefile format based on records). The actual data is contained in the block, as shown in table 2-3.
Table 2-3 Block Layout
field describes keys lengths length in this case, all the keys in a given block are saved together. This field specifies the compressed key-length size (in bytes) keys lengths byte array, containing the compressed key-length block keys size (in bytes) key byte array, containing the compressed key values in the block lengths Length in this case, all values in a given block are saved together. This field specifies the compressed value-length size (in bytes) values lengths byte array, containing compressed values-length block values size (in bytes) value byte array, containing the compressed value in the block
All formats use the same header, which contains information that can be identified by the reader. The header (see table 2-1) contains the class names of the keys and values (which are used by the reader to instantiate these classes), the version number, and the compression information. If compression is enabled, the Compression Codec class name field is incremented in the header.
The Sequencefile metadata is a series of key/value text pairs that can contain additional information about the sequencefile that the file reader/writer uses.
The implementation of a write operation with no compression format and record compression format is very similar. Each call to the append () method adds a record to the sequencefile that contains the length of the entire record (the length of the key plus the value), the length of the key, and the raw data for the key and value. The difference between compression and uncompressed versions is whether the original data is compressed using a specific codec.
Block compression format can achieve a higher compression rate. The data will not be written until a threshold value (block size) is reached, which
All keys will be compressed together. Values, as well as lists of key and value lengths, are also compressed.
Hadoop provides a special reader (Sequencefile.reader) and a writer (sequencefile.writer) for Sequencefiles. Code Listing 2-3 shows a small piece of code that uses Sequencefile.writer.
Code Listing 2-3: Using Sequencefile.writer
Revisit conf = new revisit ();
FileSystem fs = Filesystem.get (conf);
Path PATH = new Path ("FileName");
Sequencefile.writer sequencewriter = new Sequencefile.writer (FS, conf, Path,
Key.class,value.class,fs.getconf (). GetInt ("Io.file.buffer.size"),
4096), Fs.getdefaultreplication (), 1073741824, Null,new Metadata ());
.......................................................................
Sequencewriter.append (byteswritable, byteswritable);
..........................................................
Ioutils.closestream (Sequencewriter);
A simplified Sequencefile writer Constructor (Sequencefile.writer (FileSystem FS, revisit Conf, Path name, Class Keyclass, class Valclass) requires the type of file system, the Hadoop configuration, the path (file location), and the class definition of keys and values. The constructor used in the previous example supports specifying additional file parameters, including the following:
int buffersize--if not defined, use the default buffer size (4096).
Short replication--uses default replication.
Long blocksize--uses the value 1073741824 (1024MB).
Progressable progress--uses none.
Sequencefile.metadata metadata--uses an empty metadata class.
Once the writer is created, it can be used to add keys/records to the file.
One of the limitations of sequencefile is that it cannot be found based on key values. Other Hadoop file types (mapfile, Setfile, Arrayfile, and bloommapfile) overcome this limitation by adding a key based index over Sequencefile. As shown in Figure 2-4, Mapfile is actually not a file, but a directory containing two files-a data (sequence) file containing all the keys and values in the map; a smaller index file that contains a subset of the keys. You can create mapfile by adding entries sequentially. Mapfile often uses its indexes to efficiently search and retrieve the contents of a file.
The index file contains a key and a Longwritable object, and the Longwritable object holds the starting byte position of the corresponding record for that key. The index file does not contain all keys, but only a subset of them. We can set the Indexinterval using the Setindexinterval () method of the writer. The index is fully read into memory, so for a large map it is necessary to set the index jump value so that the index file is small enough to be fully loaded into memory.
Similar to Sequencefile,hadoop provides a special reader (Mapfile.reader) and a writer (mapfile.writer) for the map file.
Setfile and Arrayfile are mapfile variants used to implement specific key/value types. Setfile is a mapfile that represents a collection of keys that have no values (values are represented by the nullwritable instance). Arrayfile processes a key/value pair whose key is a continuous long (long) value. It maintains an internal counter and increases each time an append call is made. The value of this counter is used as a key.
These two file types are useful for saving keys rather than values.
Bloem Filter
A bloem filter is a highly spatial, probabilistic data structure that tests whether an element is a member of a collection. The result of the test is that the element is determined not in the collection or possibly in the collection.
The underlying data structure of the Bloem filter is the bit vector. The likelihood of false positives depends on the size of the element collection and the size of the bit vector.
Although it may be reported incorrectly, Bloem filters have a strong spatial advantage in representing collections compared to other data structures such as self-balancing binary search trees, Word lookup trees, hash tables, or simple arrays or lists. Most data structures have at least one save on the entry itself, which may require a small amount of bit bits (for little integers) to any number of bits, such as a string (a word lookup tree is a special case, because storage can be shared between elements with the same prefix).
Part of the advantage of the Bloem filter stems from its compactness (inherited from the array), and partly from its probabilistic nature.
Finally, the bloommapfile extends the mapfile implementation by adding a dynamic Profiler filter (see Supplemental content, "the profiler Filter"), which provides fast membership testing for keys. It also provides a quick version of the key search operation, especially for sparse mapfile. The append () operation of the writer updates Dynamicbloomfilter, and Dynamicbloomfilter is serialized when the writer closes. When a reader is created, the filter is loaded into memory. The reader's Get () operation first uses the filter to check the membership of the key, and if the key does not exist, it immediately returns a null value and no further I/O is done.
Data compression
One important factor to consider when storing data in a HDFs file is data compression, which converts compute loads in data processing from I/O to CPUs. Some publications provide a systematic assessment of the trade-offs between calculations and I/O when using compression in a MapReduce implementation, and the result shows that the benefits of data compression depend on the type of the processing job. For large read operations (I/O is a bottleneck) application (for example, text data processing), compression saves 35%~60% performance overhead. On the other hand, data compression has a negligible performance boost for compute-intensive applications (CPU bottlenecks).
This does not mean that data compression is not beneficial for such applications. The resources of the Hadoop cluster are shared, and the result is
A reduction in the application of I/O load will enhance the performance of other applications that use this I/O.
Does this mean that data compression is always required? The answer is no. For example, if you are using a text file or a custom binary input file, you may not need data compression because compressed files cannot be split (you'll learn more in Chapter 3rd). On the other hand, compression is always required for sequencefile and its derived file types. Finally, compressing intermediate files for shuffle and sort is always meaningful (learn more in chapter 3rd).
Remember that the results of data compression depend to a large extent on the type of data to be compressed and the compression algorithm.