Concept: Sequencefile is a text storage file consisting of a binary serialized Key/value byte stream, which can be used during the input/output format of the map/reduce process. During the map/reduce process, the temporary output of map processing files is processed using Sequencefile. So the general Sequencefile are the original files generated in the filesystem for map invocation.
1.SequenceFile features: Is an important data file type of Hadoop, it provides key-value storage, but unlike traditional key-value storage (such as hash table, btree), it is appendonly, You cannot write to a key that already exists.
There are three kinds of compression states in 2.SequenceFile:
1 uncompressed– Not compressed
2.record compressed-The value values of each record are compressed (information about which compression algorithm is used on the file header)
3. Block compressed– When the amount of data reaches a certain size, it will stop writing for overall compression, the overall compression method is to combine all keylength,key,vlength,value together for the overall compression 3. Structure: 3.1 Heade R data: Save the compressed state identification of the file; 3.2 Metadata data: A simple attribute/value pair that identifies some additional information about a file. Metadata is written when the file is created, so it is not possible to change the 3.3 appended key-value pair data 3.4 stream storage structure: The stream's storage header byte format:
Header: * Byte header "SEQ", followed by a byte that represents the version "SEQ4", "SEQ6".//Here a little forgot how to deal with, back to do a detailed explanation
*keyclass Name
*valueclass Name
*compression Boolean storage Indicates whether the compression value is converted to a keys/values value
The *blockcompression Boolean storage flag is converted to a Keys/values value in the form of full compression
*compressor types of compression processing, such as what I use gzip compression for Hadoop to provide is gzipcodec.
* Meta-data This is what we can see.
4. Extension implementation: 4.1 MapFile a key-value corresponding lookup data structure, composed of data file/data and index file/index, the data file contains all the key-value pairs that need to be stored, in the order of key. The index file contains a portion of the key value that points to the key location of the data file 4.2 setfile– is based on MapFile, and he has only key,value as immutable data. 4.3 arrayfile– is also based on the MapFile implementation, just like the array we used, the key value is the serialized number. 4.4 bloommapfile– He added a/bloom file based on MapFile, which contains a binary filter table, which is updated when each write operation is completed.
5. Use the following: Mainly writer and reader object to complete the file add and read function, the application demo reference links below, where the map end in Sequencefileinputformat format received, The map's key-value should be consistent for sequencefile.
Http://www.linuxidc.com/Linux/2012-04/57840.htm
1) encountered the problem because it is running on the cluster, the code in string seqfsurl = "Hdfs://localhost:9000/user/mjiang/target-seq/sdfgz.seq", the localhost error,
So there's always a connection problem, (retrying connect to server:localhost/127.0.0.1:8020. Already tried 0 time (s).)
So when running the program when there is no problem connecting to Hadoop, consider whether the program is wrong to write.
2) Although the file name (or any other value) is the key, the content of the sequencefile is stored as a value. But when read with Sequencefileastextinputformat, the key value is read to the first line of the file.
No analysis source, not clear why
3) Sequencefile can process. gz files (no experiments, No. gz files are not allowed to be chunked stored ... Logically or as a)ImportJava.io.BufferedInputStream;ImportJava.io.FileInputStream;ImportJava.io.IOException;ImportJava.io.InputStream;ImportJava.io.File;ImportJava.net.URI;ImportOrg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.FileSystem;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.IOUtils;Importorg.apache.hadoop.io.NullWritable;ImportOrg.apache.hadoop.io.SequenceFile;ImportOrg.apache.hadoop.io.Text; Publicclasssequenecefile{ PublicStaticvoidMain (string[] args)throwsIOException {//string Seqfsurl = "Hdfs://localhost:9000/user/mjiang/target-seq/sdfgz.seq"; String Seqfsurl = "User/mjiang/target-seq/sdfgz.seq"; Configuration conf =NewConfiguration (); Conf.set ("Fs.default.name", "hdfs://venus:9000"); Conf.set ("Hadoop.job.user", "Mjiang"); Conf.set ("Mapred.job.tracker", "venus:9001"); FileSystem fs = Filesystem.get (Uri.create (Seqfsurl), conf); Path Seqpath =NewPath (Seqfsurl); Text key = new text (); Text value =NewText (); String Filespath = "/home/mjiang/java/eclipse/hadoop/sequencefile/data/sdfgz/"; File Gzfilesdir =NewFile (Filespath); string[] Gzfiles = Gzfilesdir.list ();intFileslen=gzfiles.length; Sequencefile.writer Writer =NULL;Try{//Returns an Sequencefile.writer instance that requires the data stream and the path object to write data to the path object writer = Seque Ncefile.createwriter (FS, Conf, seqpath,nullwritable.class, Value.getclass ()); for (int i=0;i<2;i++) { while(fileslen>0) {File Gzfile =NewFile (filespath+gzfiles[fileslen-1]); InputStream in =NewBufferedinputstream (NewFileInputStream (Gzfile));LongLen = Gzfile.length ();byte[] Buff =Newbyte[(int) Len];if ((Len = in.read (Buff))!= -1) { value.set (Buff); writer.append (Nullwritable.get (), value);//Append each record to the end of the Sequencefile.writer instance } //process system.out.println (gzfiles[fileslen-1]); //key.clear (); value.clear (); ioutils.closestream (in); fileslen--;//!! } //filesLen = 2; } } finally{Ioutils.closestream (writer); } } }