Hadoop rcfile storage format (source analysis, code example)

Source: Internet
Author: User
Tags hadoop mapreduce

Rcfile rcfile full Name record columnar file, which is a sequencefile-like key-value pair (key/value Pairs) data file.   Keywords: Record, columnar, key, Value. Where is the advantage of  rcfile? What are the scenarios for? To give everyone a sense of sensibility, let's look at an example.   Suppose we have a hive table of 9 rows and 3 columns, stored in a normal textfile,   now we need to count the number of occurrences of the second column (col2) value of this datasheet as "Row5_col2", We usually write Sql: select count (*) from table where col2 = ' row5_col2 '   This hive SQL translates to the corresponding MapReduce program execution, Although we only need to query the table's 2nd column of data to get results, but because we are using the Textfile storage format, we have to read the entire data table data participation calculation. Although we can use some compression mechanism to optimize the storage, reduce the amount of data read, but the effect is usually not significant, and after all, read a lot of useless data (col1, col3).   Look at how Rcfile will store the data in this data sheet? Macro can be broadly divided into the following three steps:  (1) Horizontal division;   after the horizontal division of the various data blocks called row split or record.   (2) Vertical partitioning;   each row split or record is then vertically divided by "column".   (3) column storage;   rcfile is stored in record units. When  record stores data, it first stores all the data in the first column of the record, stores all the data for the second column in the record 、...、 sequentially stores the column data, and then resumes the storage of the next record.  record is actually composed of key and value, where key holds the metadata of the record, such as the number of columns, the length of each column of data, the length of each column value in each column, and so on; value holds the data for each column of the record. In fact, the record key is equivalent to the record's index, which makes it easy to implement the record's internal read/filter operations on some columns.   and Rcfile "row" storage into "Columnstore", similar data toThe high probability is gathered together, the compression effect is better.   In order to master the storage format of a data file, it is necessary to know how the data is written in a way that reads just the opposite of the write. Rcfile the writer and reader classes for write and read respectively, this article only discusses the writer class implementation.   Source Code Analysis   Generally speaking, the entire writing process of the Rcfile file can be broadly divided into three steps:   (1) Build Rcfile.writer instance--writer (...)   (2) write data by Rcfile.writer instance--append  (3) Close Rcfile.writer instance--close  We also follow these three steps to analyze the corresponding source code.  1. Writer  writer does the following three things in the Build function:  (1) initializes some variable values;  a. record_interval: Indicates how much "row" of data forms a row  split (Record) and Columnsbuffersize,  b. columnnumber: Data that represents how many "columns" are stored in the current Rcfile file; c.  The Metadata  metadata instance simply holds a property "Hive.io.rcfile.column.number" with a value of ColumnNumber, which is serialized to the Rcfile file header;  D. columnsbuffersize: Cache Number (number of rows) upper threshold value, exceeding this value will create a row split (Record) of the cached data (rows);       (2) construct some data structures;  a. columnvalueplainlength: holds the size of the original data for each column within a row Split (Record); b.  Columnbuffers: The original data of a row split (record) is saved;  c. Key: A metadata that holds a row split (record); d.  Plaintotalcolumnlength: Preserves the size of the original data in a Rcfile file; e.  Comprtotalcolumnlength: Save a Rcfile file in the original column of the raw data is compressed;  (3) Initialize the file output stream, and write the file header information;  a. Initialize the Rcfile file output stream (Fsdataoutputstream); &nBsp;  usenewmagic The default value is True, this article is also discussed with this default value.  b. Initializefileheader;  i. Write Magic;ii. Write the current Rcfile version number (different versions of rcfile have different formats);  c. Writefileheader;  i. Write whether or not to use compression, this article is discussed by using compression; Write down the compression codec/decoder (COMPRESSIONCODEC) class name; Serializes a metadata instance;  c. finalizefileheader;   writes out a "sync flag bit", which means the end of Rcfile file header information.   We can conclude that the structure of the Rcfile header is as follows:  tr>
version 3 bytes of magic header "RCF", followed by 1 byte of actual version num ber
compression  a boolean which specifies if compres Sion is turned on for keys/values in this file
compression& Nbsp;codec compressioncodec class which is used For compression of keys  and/or values
metadata metadata for this file
sync a sync marker to denote end of the header
 2. Append  rcfile.writer Write data requires an "append" in the form of an bytesrefarraywritable instance, i.e. a bytesrefarraywritable instance represents a "row" of data.   The process of appending "row" data is as follows:  (1) resolves each "column" data cache from a "row" data (that is, bytesrefarraywritable instance val) to the corresponding Columnbuffer (that is, columnbuffers[ I]); If the row data contains a column less than ColumnNumber, the missing column is populated with a null value (that is, bytesrefwritable.zerobytesrefwritable);  as we can see, rcfile in the " Append "Data is still done in a" row "way," Row to column "is internally converted. The column data after conversion (the number of columns is ColumnNumber) is cached in the respective "buffer", which means that each column has its own separate buffer (Columnbuffer), which is prepared for later "Columnstore".   Here is an introduction to this columnbuffer, its role is to cache "column data" inside the,   contains two instance variables, such as their variable name, they are actually used to cache the data, Columnvalbuffer is used to cache "column values" of data, Vallenbuffer is used to cache "column values" of their respective lengths, both internal buffers are nonsyncdataoutputbuffer instances.    from these three parts of the code, it can be seen that the buffer inside the Nonsyncdataoutputbuffer is actually built using an in-memory byte array (BUF) and inherits from DataOutputStream, which allows us to use the "flow" Data in the form of operations.   and Vallenbuffer in the cache "column value" of the length of time, in order to effectively save storage space, the use of a trick,   that is, if you need to save the "column value" Length of "1,1,1,2", you need to store four integers, And the value of the first three integers is the same, then we change it to "1,~2,2", "~" means we need to repeat the integer "1" in front of it two times. This approach saves a lot of storage space if the data is highly repeatable.   (2) A "line" of data converted to multiple "column" data, and cached to their respective buffer, need to make two judgmentsDoes the:  cache "column" data (in this case, all column data in columnbuffers) exceed the upper threshold columnsbuffersize?   Does the number of "row" records in the cache exceed the upper threshold record_interval?   If both of these conditions are met, we believe that enough data has been cached to create a row split or record for "overflow" of the buffer.   These two upper threshold thresholds (columnsbuffersize, record_interval) also suggest that we need to adjust these two values according to the actual application.   "Overflow" is done through flushrecords, which can be said to be the most "complex" operation of the entire rcfile writing process.   mentioned earlier, Rcfile Record (Row Split) is actually composed of key, value, and now these "column" data has been cached in the columnbuffers, then key data where?   This key is actually the metadata of the row split (record), which can also be understood as the index of Row split (record), which is represented by Keybuffer,   columnnumber: Number of columns Numberrows:rcfile record (row Split) internally stores how much "row" data, the same Rcfile file, the number of rows saved in different record may be different;  rcfile record Value is actually the column values in the previously mentioned columnbuffers (which may be compressed), and the metadata for these columnbuffers is represented by the following three variables:  Eachcolumnvaluelen:eachcolumnvaluelen[i] Indicates the total size of the column data (raw data) cached in Columnbuffers[i] ; Eachcolumnuncompressedvaluelen:eachcolumnuncompressedvaluelen[i] Indicates the total size of the cached column data in Columnbuffers[i] after it has been compressed If not compressed, the value is the same as columnbuffers[i]; Allcellvallenbuffer:allcellvallenbuffer[i] represents columnbuffers[i] The respective lengths of those column data (note the preservation techniques of these lengths mentioned in the front);nbsp After the Keybuffer is serialized, it is structured as follows: 
numberrows number_of_rows_in_this_record (vint)
colum Nvaluelen column_1_ondisk_compressed_length (vint)
columnuncompressedvaluelen column_1_ondisk_uncompressed_length (vint)
column_1_row_1_value_plain_length  
column_1_row_2_value_plain_length  
...  
columnvaluelen column_2_ondisk_compressed_length (vint)
columnuncompressedvaluelen column_2_ondisk_uncompressed_length (vint)
column_2_row_1_value_plain_length  
column_2_row_2_value_plain_length  
...  
  Why is it that such meta data can be used as an index?   Notice the above multiple Columnvaluelen (Columnuncompressedvaluelen), which holds the total length of multiple columns (clusters) in the record value, Each columnvaluelen (Columnuncompressedvaluelen) is followed by the length of multiple column values within the column (cluster). If we just need to read the data from column N, we can skip directly the data in the previous (n-1) column of record value based on Columnvaluelen (Columnuncompressedvaluelen). The  keybuffer data is built during the "overflow write" process. Let's take a detailed analysis of Flushrecords's specific logic.   key is an instance of Keybuffer, which is equivalent to recording the row number of the row Split in the metadata;   this code is meaningful with compressed scenarios It builds a buffer valuebuffer, and uses "adorner" mode to build a compressed output stream for later writing data from Columnbuffers to the buffer valuebuffer, The data in the Valuebuffer is compressed (you will see this process later).   The next step is to deal with the data in Columnbuffers, briefly, for one columnbuffers[i] to do two things:  (1) If you use compression, you need to columnbuffers[i] The data is written in Valuebuffer by compressing the output stream deflateout, (2) Maintaining a number of related variable values;   This code looks longer, for a columnbuffers[i], the actual thing can be summarized as four steps:   (1) If compression is used, all data in Columnbuffers[i] is written to Deflateout (actually Valuebuffer) and (2) the length of the recording columnbuffers[i] after compression Collen If using compression is not used, the value is the same as the original data length; (3) record Columnbuffers[i] related metadata: columnbuffers[i] Length of the compressed/uncompressed data, the length of each column value in Columnbuffers[i];    (4) Maintenance PlaintotalcolumnlengtH, comprtotalcolumnlength;  code so far, all the metadata for a record (Row Split) has been built, and if compression is enabled, the data in columnbuffers is all compressed and written Valuebuffer The following is the "persistence" of the record Key and value.   (1) Write the key out   i. checkandwritesync    here's why you need this "sync" first.   For example, we have a "big" text file that needs to be analyzed using Hadoop mapreduce. Hadoop mapreduce "slices" the large text file according to the "slice" size (assuming 128M) before submitting the job, and each maptask handles a "slice" of the file (where it is not considered to handle multiple slices), which is part of the data for this file. Text files are stored by row, so how do you position the starting position of a row of records when Maptask reads the file data from the beginning of a slice? After all, "slices" are sliced directly by the size of the bytes, and it is possible to "cut" a line of records exactly. At this point, we need to have a "sync", the equivalent of a flag bit, so that we can identify the starting position of a row of records, for a text file, this "sync" is a newline character. Therefore, when Maptask reads data from the beginning of a slice, it first "filters" the data until it encounters a newline character before it begins to read the data, and if it finds that the "file cursor" exceeds the range of the "slice" after reading a row of data, the read ends.  rcfile also need such a "sync", for text files, is a "sync" of each line of text, Rcfile is stored in record units, but not each record with a "sync", but two "sync" There is an interval limit between Sync_interval,  sync_interval = * (4 + +)   Calculates the output position of the current file relative to the last "SYNC" before each start of outputting the next record's data Offset, and a "sync" is output if the sync_interval is exceeded.   So what is this "sync"?    that is, Rcfile's "Sync" is a random byte string length of 16 bytes, which does not discuss the UIDThe build process.  ii. Write total record length, key portion LENGTH  III. Write keylength, keybuffer   Note that the keylength here differs from the Keylength in II: The Keylength in II is equivalent to the length of the Keybuffer raw data recorded While the Keylength in III is equivalent to the length of the Keybuffer raw data being compressed, and if there is no compression, the value is the same as the Keylength in II.   Before this code, there is a compression process for keybuffer (if compression is enabled), which is similar to the Columnbuffer compression process and is not covered.   As can be seen from the code above, there is a structure before the record Key (Keybuffer), which is equivalent to the record header: 
Recordlen Record length in bytes
Keylength Key length in bytes
Compressedkeylen Compressed Key length in bytes
(2) Write the value out if compression is enabled, the compressed data in the Valuebuffer is written directly, and if compression is not enabled, the data in the columnbuffers needs to be written out individually. The structure of the Rcfile Record value is actually the column value of each "column cluster", as follows:
Column_1_row_1_value
Column_1_row_2_value
...
Column_2_row_1_value
Column_2_row_1_value
...
The code is here, and we're finished with a row Split (Record) output. Finally, the record is emptied to prepare for the cache output of the next row Split (record), 3. The close operation of the close Rcfile file is broadly divided into two steps: (1) If there is still data in the buffer, call flushrecords to "overflow" the data, and (2) close the file output stream. code Example 1. Write (1) constructs the writer instance; Note that you must set Rcfile's column count in the Hadoop configuration through the property hive.io.rcfile.column.number.conf. (2) Build multiple rows of data, each row of data using a bytesrefarraywritable instance representation. (3) writer append; (4) writer close; 2. When reading reads, it is important to note that Rcfilerecordreader's constructor requires that a "slice" be specified, and if we need to read the entire file, we need to make the entire file a "slice" (above); After the Rcfilerecordreader instance is built , you can iterate over key and value through next (), where key is the number of rows and value is the row record.  Code output What should we do if we just need to read the data in columns 1th and 3rd? With this setting, we can get the following output: Note that although 3 columns of data are still being read, the data in column 2nd has been returned as "null".

Hadoop rcfile storage format (source analysis, code example)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.