File-based data structures
Two file formats:
1, Sequencefile
2, MapFile
Sequencefile
1. sequencefile files are <key,value>
flat files (Flat file) designed by Hadoop to store binary forms of pairs.
2, can sequencefile as a container, all the files packaged into the Sequencefile class can be efficiently stored and processed small files .
3. sequencefile files are not sorted according to their stored key. The Sequencefile internal class writer** provides the append functionality * *.
4. The key and value in Sequencefile can be arbitrary type writable or the writable type defined by itself.
Sequencefile Compression
1. The internal format of sequencefile depends on whether compression is enabled. Suppose it is. Either the record is compressed. either a block compression.
2, three kinds of types:
A. No compression type : Assuming that compression is not enabled (the default), then each record is made up of its record length (in bytes) and the length of the key. The key and value are composed. The Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the uncompressed format. The difference is that the value byte is compressed with the encoder defined in the head. Note that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once . Therefore it is more compact than record compression and is generally preferred .
When the number of bytes recorded reaches the minimum size, the block is added.
The minimum value io.seqfile.compress.blocksize
is defined by the property in.
The default value is 1000000 bytes.
The format is record count, key length, key, value length, value.
no compression format and record compression format
block compression format
advantages of the Sequencefile file format:
A Supports data compression based on records (record) or blocks (block).
B Supports splittable, which can be used as input shards for mapreduce.
C Simple change: The main responsibility is to change the corresponding business logic, regardless of the detailed storage format.
Disadvantages of the Sequencefile file format:
The downside is the need for a merge file process. And the merged files will not be easily viewable. because it is a binary file.
read/write Sequencefile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
Org.apache.hadoop.io Class Sequencefilethere isthreeSequencefile Writers basedOn the sequencefile. Compressiontype used to Compress key/value pairs:1, Writer:uncompressed Records.2, recordcompresswriter:record-compressedFiles, onlyCompressValues.3, blockcompresswriter:block-compressedFiles, bothKeys& Values is collectedinch ' Blocks 'Separately andCompressed. The size of the ' Block 'is configurable
no compression, record compression, block compression instances
Package Sequencefile;Import Java. IO. IOException;Import Java. NET. URI;import org. Apache. Hadoop. conf. Configuration;import org. Apache. Hadoop. FS. FileSystem;import org. Apache. Hadoop. FS. Path;import org. Apache. Hadoop. IO. Ioutils;import org. Apache. Hadoop. IO. Intwritable;import org. Apache. Hadoop. IO. Sequencefile;import org. Apache. Hadoop. IO. Sequencefile. Compressiontype;import org. Apache. Hadoop. IO. Text;import org. Apache. Hadoop. IO. Writable;import org. Apache. Hadoop. IO. Compress. BZip2Codec;import org. Apache. Hadoop. Util. Reflectionutils;public class Demo01 {final static String uri="Hdfs://liguodong:8020/liguodong";Final static string[] data = {"Apache,software","Chinese,good","James,nba","Index,pass"};public static void Main (string[] args) throws IOException {//1Configuration configuration = new configuration ();//2FileSystem fs = FileSystem. Get(URI. Create(URI), configuration);//3Path PATH = new Path ("/tmp.seq");Write (Fs,configuration,path);Read (Fs,configuration,path);} public static void Write (FileSystem fs,configuration configuration,path Path) throws ioexception{//4Intwritable key = new Intwritable ();Text value = new text ();No compression/* @SuppressWarnings ("deprecation") sequencefile.writer Writer = Sequencefile.createwriter (Fs,co Nfiguration,path,key.getclass (), Value.getclass ()); * *Record Compression @SuppressWarnings ("Deprecation") Sequencefile. Writerwriter = Sequencefile. Createwriter(Fs,configuration,path,key. GetClass(), Value. GetClass(), Compressiontype. RECORD, new Bzip2codec ());Block compression/* @SuppressWarnings ("deprecation") sequencefile.writer Writer = Sequencefile.createwriter (Fs,co Nfiguration,path,key.getclass (), Value.getclass (), Compressiontype.block,new Bzip2codec ()); * *//5for (int i =0; i <; i++) {Key. Set( --I.);Value. Set(Data[i%data. Length]);Writer. Append(Key, value);} //6, close the flow ioutils. Closestream(writer); } public static void Read (FileSystem fs,configuration Configuration,path Path) throws IOException {//4@SuppressWarnings ("Deprecation") Sequencefile. Readerreader = new Sequencefile. Reader(FS, path,configuration);//5Writable key = (writable) reflectionutils. newinstance(Reader. Getkeyclass(), configuration);Writable value = (writable) reflectionutils. newinstance(Reader. Getvalueclass(), configuration);while (reader. Next(Key,value)) {System. out. println("key ="+ key);System. out. println("value ="+ value);System. out. println("position ="+ Reader. GetPosition());} ioutils. Closestream(reader);}}
Execution Result:
Key= 100value= Apache,softwareposition= 164Key= 99value= Chinese,goodposition= 197Key= 98value= James,nbaposition= 227Key= 97value= Index,passposition= 258Key= 96value= Apache,softwareposition= 294Key= 95value= Chinese,goodposition= 327 ...Key= 72value= Apache,softwareposition= 1074Key= 71value= Chinese,goodposition= 1107
MapFile
publicclass MapFile { /** The name of the index file. */ publicstaticfinal"index"; /** The name of the data file. */ publicstaticfinal"data";}
The mapfile is the sequencefile of the sorted index and can be found based on key.
Unlike the Sequencefile, it is. Mapfile key must implement the Writablecomparable interface, that is, the key value is comparable, and value is the writable type.
The Mapfile.fix () method can be used to reconstruct the index and convert the Sequencefile into mapfile.
It has two static member variables:
staticfinal String INDEX_FILE_NAMEstaticfinal String DATA_FILE_NAME
By observing its folder structure, we can see that mapfile consists of two parts, each of which is data and index.
Index, which is a data-indexed file, mainly records the key value of each record and the offset of the record in the file.
When Mapfile is interviewed, the index file is loaded into memory, and the index mapping relationship can quickly navigate to the file location where the record is specified.
So. Compared with sequencefile, the retrieval efficiency of mapfile is efficient. The disadvantage is that some memory is consumed to store the index data.
It should be noted that the Mapfile does not record all records in Index, and by default stores an index map for every 128 records. Of course, the recording interval can be artificially altered through the MapFIle.Writer的setIndexInterval()
method. or change io.map.index.interval
properties.
read/write Mapfile
Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream
The detailed operation is similar to Sequencefile.
command line view binary file
hdfs dfs -text /liguodong/tmp.seq
Hadoop file-based data structures and examples