Reprinted from: https://my.oschina.net/leejun2005/blog/280896
Hadoop, as an open-source implementation of Mr, has been an advantage of dynamically running parsing file formats and getting loads that are several times faster than the MPP database. However, the MPP database community has also been criticizing Hadoop because the file format is not built for a specific purpose, so the cost of serialization and deserialization is too high.
1. Introduction to Hadoop file format
There are several types of file formats prevalent in Hadoop today:
(1) Sequencefile
Sequencefile is a binary file provided by the Hadoop API that serializes data into files in the form of <key,value>. This binary file is internally serialized and deserialized using the standard writable interface of Hadoop. It is compatible with the Mapfile in the Hadoop API. The Sequencefile in Hive inherits from the sequencefile of the Hadoop API, but its key is empty, using value to hold the actual value, so as to avoid the ordering of Mr in the run map phase. If you write Sequencefile with the Java API and have hive read it, be sure to use the Value field to hold the data, otherwise you will need to customize the InputFormat class and OutputFormat to read this sequencefile Class
(2) Rcfile
Rcfile is a dedicated column-oriented data format introduced by hive. It follows the design concept of "divide first by column and then vertically". When a query is in progress, it skips those columns on Io for columns that it does not care about. It should be noted that rcfile in the map phase from the remote copy is still copying the entire data block, and copied to the local directory after the Rcfile is not really directly skip the unnecessary columns, and jump to the column to be read, but by scanning each row group's header definition to achieve, However, the head at the entire HDFs Block level does not define which row group each column starts from to which row group ends. Therefore, in the case of reading all the columns, rcfile performance is not sequencefile high.
Examples of HDFs block insider storage
Examples of column storage in HDFS blocks
Examples of rcfile stored in HDFS blocks
(3) Avro
Avro is a binary file format for supporting data-intensive. Its file format is more compact, and Avro can provide better serialization and deserialization performance when it comes to reading large amounts of data. And Avro data files are inherently schema-defined, so it doesn't require developers to implement their own writable objects at the API level. More recent Hadoop sub-projects support Avro data formats such as pig, Hive, Flume, Sqoop, and Hcatalog.
(4) Text Format
In addition to the 3 binary formats mentioned above, data in text format is often encountered in Hadoop. such as Textfile, XML, and JSON. In addition to using more disk resources, the parsing overhead of the text is typically dozens of times times higher than the binary format, especially XML and JSON, which are more expensive to parse than textfile, so it is strongly not recommended to use these formats for storage in production systems. If you need to output these formats, do the appropriate conversion operations on the client. Text formatting is often used for log collection, database import, hive default configuration is also used in text format, and often easy to forget the compression, so make sure to use the correct format. Another disadvantage of the text format is that it does not have types and patterns, such as sales amount, profit value data or date time type data, if the use of text format to save, because of their own string type of different length, or contain negative numbers, resulting in Mr No way to sort, Therefore, it is often necessary to preprocess them into binary formats that contain patterns, which leads to unnecessary preprocessing steps and wasted storage resources.
(5) External format
Hadoop actually supports any file format, as long as the corresponding Recordwriter and Recordreader can be implemented. The database format is also often stored in Hadoop, such as Hbase,mysql,cassandra,mongodb. These formats are typically used to avoid the need for large amounts of data movement and fast loading. Their serialization and deserialization are done by clients in these database formats, and the file's storage location and data layout are not controlled by Hadoop, and their file slices are not cut by the HDFs block size (blocksize).
2, why need Rcfile
Facebook introduced Data Warehouse hive at the ICDE (IEEE International Conference on data Engineering) conference. Hive stores massive amounts of data in a Hadoop system, providing a set of data storage and processing mechanisms for the class database. It uses the class SQL language to automate the management and processing of data, through statement parsing and transformation, resulting in the generation of Hadoop-based mapreduce tasks that perform these tasks to complete data processing. Shows the system structure of the hive Data Warehouse.
The challenge of storage scalability that Facebook encounters on the Data Warehouse is unique. They store more than 300PB of data in a hive-based data warehouse and grow at a new 600TB per day. The amount of data stored in this data warehouse increased 3 times times last year. With this growth trend in mind, storage efficiency is one of the most interesting aspects of the Facebook Data Warehouse infrastructure that is currently and will be in the future. The rcfile:a Fast and spaceefficient data Placement Structure in mapreducebased Warehouse Systems, published by Facebook engineers, describes an efficient data The storage structure--rcfile (Record columnar File) and applies it to Facebook's Data Warehouse hive. Compared to the data storage structure of traditional databases, rcfile more efficiently meets the four key requirements of a mapreduce-based data warehouse, namely fast data loading, fast query processing, highly efficient Storage space utilization and strong adaptivity to highly dynamic workload patterns. Rcfile is widely used in the data analysis System hive of the Facebook company. First, the Rcfile has the equivalent of the data loading speed and load adaptability of the row storage, secondly, the Rcfile read optimization can avoid unnecessary column reading when scanning the table, the test shows that in most cases, it has better performance than other structures; again, Rcfile uses the compression of the column dimension, Therefore, the storage space utilization can be improved effectively.
To improve storage space utilization, data generated by Facebook's product line applications have been stored in the Rcfile structure since 2010, and datasets stored in a row-store (SEQUENCEFILE/TEXTFILE) structure are also dumped in rcfile format. In addition, Yahoo has integrated rcfile,rcfile in the Pig data Analysis system for another Hadoop-based data management system Howl (Http://wiki.apache.org/pig/Howl). Furthermore, Rcfile has successfully integrated into other MapReduce-based data analysis platforms, based on the communication of the hive development community. It is reasonable to believe that rcfile, as a data storage standard, will continue to play an important role in large-scale data analysis in a mapreduce environment.
3, Rcfile Introduction
The first storage format used when data in a Facebook data warehouse is loaded into a table is the Record-columnar File format (rcfile) that Facebook developed itself. Rcfile is a hybrid Columnstore format that allows query by row, which provides the compression efficiency of column storage. Its core idea is to divide the hive table horizontally into multiple row groups (row groups), and then slice vertically within the group so that columns and columns of data are contiguous blocks of storage on disk.
When all columns within a row group are written to disk, Rcfile compresses the data using a similar ZLIB/LZO algorithm in column units. Lazy decompression Policy (lazy decompression) is used when reading column data, which means that if a user's query involves only a subset of the columns in a table, Rcfile skips the process of extracting and deserializing the columns that are not needed. By choosing a representative example experiment in the Facebook Data Warehouse, Rcfile can provide 5 times times the compression ratio.
4, Beyond Rcfile, the next step to adopt what method
As the amount of data stored in the Data warehouse continues to grow, engineers in the FB group begin to study techniques and methods for improving compression efficiency. The focus of the study is on column-level coding methods, such as stroke length encoding (run-length encoding), Dictionary encoding (Dictionary encoding), reference frame encoding (frame of reference encoding), The ability to reduce the number of logical redundancy at the column level before the general compression process is a numeric encoding method. FB has also tried out new column types (for example, JSON is a widely used format within Facebook, which stores JSON-formatted data in a structured way to meet the demands of efficient queries while also reducing the redundancy of JSON metadata storage). FB experiments show that column-level coding can significantly improve the compression ratio of rcfile if used properly.
At the same time, Hortonworks is experimenting with similar ideas to improve the storage format of hive. Hortonworks's engineering team designed and implemented Orcfile (including storage formats and read-write interfaces), which helped to provide a good start to the design of the Facebook data Warehouse and the implementation of new storage formats.
For Orcfile's introduction, please see here: http://yanbohappy.sinaapp.com/?p=478
On the performance evaluation, the author here for the time being no conditions, affixed to a certain hive technical Summit speakers:
5. How to generate Rcfile files
It says so much, presumably you already know rcfile is mainly used to improve the efficiency of hive query, how to generate files of this format?
(1) Insert conversion via Textfile table directly in hive
For example:
Insert table http_rctable partition (dt='2013-09-30'selectfrom where dt='2013-09-30';
(2) generated by MapReduce
So far, MapReduce has not provided the built-in API to support Rcfile, but other projects in the Hadoop ecosystem such as pig, Hive, and Hcatalog have been supported because Rcfile compared to other file formats such as Textfile, There is no significant advantage in MapReduce's application scenario.
In order to avoid repeating the wheel, the following generated Rcfile's MapReduce code calls hive and Hcatalog related classes, note that you test the following code, your Hadoop, hive, Hcatalog version to be consistent, otherwise ... You know...
For example, I use the hive-0.10.0+198-1.cdh4.4.0, then you should download the corresponding version: http://archive.cloudera.com/cdh4/cdh/4/
PS: The following code has been tested through, the wood has a problem.
Importjava.io.IOException;Importorg.apache.hadoop.conf.Configuration;Importorg.apache.hadoop.conf.Configured;ImportOrg.apache.hadoop.fs.Path;Importorg.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable;Importorg.apache.hadoop.hive.serde2.columnar.BytesRefWritable;Importorg.apache.hadoop.io.NullWritable;ImportOrg.apache.hadoop.io.Text;ImportOrg.apache.hadoop.mapreduce.Job;ImportOrg.apache.hadoop.mapreduce.Mapper;ImportOrg.apache.hadoop.mapreduce.lib.input.FileInputFormat;ImportOrg.apache.hadoop.util.GenericOptionsParser;ImportOrg.apache.hadoop.util.Tool;ImportOrg.apache.hadoop.util.ToolRunner;ImportOrg.apache.hcatalog.rcfile.RCFileMapReduceInputFormat;ImportOrg.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat; Public classTexttorcfileextendsConfiguredImplementstool{ Public Static classMapextendsMapper<object, Text, nullwritable, bytesrefarraywritable>{ Private byte[] fieldData; Private intNumcols; Privatebytesrefarraywritable bytes; @Overrideprotected voidSetup (Context context)throwsIOException, interruptedexception {numcols= Context.getconfiguration (). GetInt ("hive.io.rcfile.column.number.conf", 0); Bytes=Newbytesrefarraywritable (Numcols); } Public voidmap (Object key, Text line, context context)throwsIOException, interruptedexception {bytes.clear (); String[] cols= Line.tostring (). Split ("\\|"); System.out.println ("SIZE:" +cols.length); for(inti=0; i<numcols; i++) {FieldData= Cols[i].getbytes ("UTF-8"); Bytesrefwritable cu=NULL; Cu=NewBytesrefwritable (fieldData, 0, fielddata.length); Bytes.set (i, CU); } context.write (Nullwritable.get (), bytes); }} @Override Public intRun (string[] args)throwsException {Configuration conf=NewConfiguration (); String[] Otherargs=Newgenericoptionsparser (conf, args). Getremainingargs (); if(Otherargs.length < 2) {System.out.println ("Usage:" + "Hadoop jar Rcfileloader.jar <main class>" + "-tablename <tabl ename>-numcols <numberOfColumns>-input <input path> "+"-output <output path>- Rowgroupsize <rowGroupSize>-iobuffersize <ioBufferSize> "); System.out.println ("For Test"); System.out.println ("$HADOOP jar Rcfileloader.jar edu.osu.cse.rsam.rcfile.mapreduce.LoadTable" + "-tablename test1-numcol s 10-input rcfileloadertest/test1 "+"-output rcfileloadertest/rcfile_test1 "); System.out.println ("$HADOOP jar Rcfileloader.jar edu.osu.cse.rsam.rcfile.mapreduce.LoadTable" + "-tablename test2-numcol s 5-input rcfileloadertest/test2 "+"-output rcfileloadertest/rcfile_test2 "); return2; } /*For Test*/String TableName= ""; intNumcols = 0; String InputPath= ""; String OutputPath= ""; intRowgroupsize = 16 *1024*1024; intIobuffersize = 128*1024; for(inti=0; i<otherargs.length-1; i++){ if("-tablename". Equals (Otherargs[i])) {TableName= Otherargs[i+1]; }Else if("-numcols". Equals (Otherargs[i])) {Numcols= Integer.parseint (otherargs[i+1]); }Else if("-input". Equals (Otherargs[i])) {InputPath= Otherargs[i+1]; }Else if("-output". Equals (Otherargs[i])) {OutputPath= Otherargs[i+1]; }Else if("-rowgroupsize". Equals (Otherargs[i])) {Rowgroupsize= Integer.parseint (otherargs[i+1]); }Else if("-iobuffersize". Equals (Otherargs[i])) {Iobuffersize= Integer.parseint (otherargs[i+1]); }} conf.setint ("Hive.io.rcfile.record.buffer.size", rowgroupsize); Conf.setint ("Io.file.buffer.size", iobuffersize); Job Job=NewJob (conf, "rcfile loader:loading table" + TableName + "with" + Numcols + "columns"); Job.setjarbyclass (texttorcfile.class); Job.setmapperclass (Map.class); Job.setmapoutputkeyclass (nullwritable.class); Job.setmapoutputvalueclass (bytesrefarraywritable.class);//job.setnumreducetasks (0);Fileinputformat.addinputpath (Job,NewPath (InputPath)); Job.setoutputformatclass (Rcfilemapreduceoutputformat.class); Rcfilemapreduceoutputformat.setcolumnnumber (Job.getconfiguration (), numcols); Rcfilemapreduceoutputformat.setoutputpath (Job,NewPath (OutputPath)); Rcfilemapreduceoutputformat.setcompressoutput (Job,false); System.out.println ("Loading table" + TableName + "from" + InputPath + "to rcfile located at" +OutputPath); System.out.println ("Number of columns:" + job.getconfiguration (). Get ("hive.io.rcfile.column.number.conf")); System.out.println ("Rcfile Row Group size:" + job.getconfiguration (). Get ("Hive.io.rcfile.record.buffer.size")); System.out.println ("Io bufer Size:" + job.getconfiguration (). Get ("Io.file.buffer.size")); return(Job.waitforcompletion (true) ? 0:1); } Public Static voidMain (string[] args)throwsException {intres = Toolrunner.run (NewConfiguration (),Newtexttorcfile (), args); System.exit (RES); }}
6, Refer:
(1) Analysis of Hadoop file format Http://www.infoq.com/cn/articles/hadoop-file-format
(2) Facebook Data Warehouse disclosure: Rcfile efficient storage structure http://www.csdn.net/article/2011-04-29/296900
(3) How the Facebook data Warehouse is extended to 300PB http://yanbohappy.sinaapp.com/?p=478
(4) Hive Architecture http://www.jdon.com/bigdata/hive.html
(5) HIVE:ORC File Format storage format detailed http://www.iteblog.com/archives/1014
(6) Ordinary text compression into Rcfile general class Https://github.com/ysmart-xx/ysmart/blob/master/javatest/TextToRCFile.java
http://hugh-wangp.iteye.com/blog/1405804 map reduce code writing based on the hive file format
http://smallboby.iteye.com/blog/1596776 general text compression into a generic class of Rcfile
http://smallboby.iteye.com/blog/1592531 Rcfile Storage and read operations
Https://github.com/kevinweil/elephant-bird/blob/master/rcfile/src/main/java/com/twitter/elephantbird/mapreduce /output/rcfileoutputformat.java
http://blog.csdn.net/liuzhoulong/article/details/7909863
Hive-based file format: Rcfile introduction and its application