Generate a Sequencefile file with a large number of small files under Hadoop

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Concept: Sequencefile is a text storage file consisting of a binary serialized Key/value byte stream, which can be used during the input/output format of the map/reduce process. During the map/reduce process, the temporary output of map processing files is processed using Sequencefile. So the general Sequencefile are the original files generated in the filesystem for map invocation.
1.SequenceFile features: Is an important data file type of Hadoop, it provides key-value storage, but unlike traditional key-value storage (such as hash table, btree), it is appendonly, You cannot write to a key that already exists.

There are three kinds of compression states in 2.SequenceFile:

1 uncompressed– Not compressed

2.record compressed-The value values of each record are compressed (information about which compression algorithm is used on the file header)

3. Block compressed– When the amount of data reaches a certain size, it will stop writing for overall compression, the overall compression method is to combine all keylength,key,vlength,value together for the overall compression 3. Structure: 3.1 Heade R data: Save the compressed state identification of the file; 3.2 Metadata data: A simple attribute/value pair that identifies some additional information about a file. Metadata is written when the file is created, so it is not possible to change the 3.3 appended key-value pair data 3.4 stream storage structure: The stream's storage header byte format:
Header: * Byte header "SEQ", followed by a byte that represents the version "SEQ4", "SEQ6".//Here a little forgot how to deal with, back to do a detailed explanation
*keyclass Name
*valueclass Name
*compression Boolean storage Indicates whether the compression value is converted to a keys/values value
The *blockcompression Boolean storage flag is converted to a Keys/values value in the form of full compression
*compressor types of compression processing, such as what I use gzip compression for Hadoop to provide is gzipcodec.
* Meta-data This is what we can see.
4. Extension implementation: 4.1 MapFile a key-value corresponding lookup data structure, composed of data file/data and index file/index, the data file contains all the key-value pairs that need to be stored, in the order of key. The index file contains a portion of the key value that points to the key location of the data file 4.2 setfile– is based on MapFile, and he has only key,value as immutable data. 4.3 arrayfile– is also based on the MapFile implementation, just like the array we used, the key value is the serialized number. 4.4 bloommapfile– He added a/bloom file based on MapFile, which contains a binary filter table, which is updated when each write operation is completed.

5. Use the following: Mainly writer and reader object to complete the file add and read function, the application demo reference links below, where the map end in Sequencefileinputformat format received, The map's key-value should be consistent for sequencefile.

Http://www.linuxidc.com/Linux/2012-04/57840.htm

1) encountered the problem because it is running on the cluster, the code in string seqfsurl = "Hdfs://localhost:9000/user/mjiang/target-seq/sdfgz.seq", the localhost error,

So there's always a connection problem, (retrying connect to server:localhost/127.0.0.1:8020. Already tried 0 time (s).)

So when running the program when there is no problem connecting to Hadoop, consider whether the program is wrong to write.

2) Although the file name (or any other value) is the key, the content of the sequencefile is stored as a value. But when read with Sequencefileastextinputformat, the key value is read to the first line of the file.

No analysis source, not clear why

3) Sequencefile can process. gz files (no experiments, No. gz files are not allowed to be chunked stored ... Logically or as a)ImportJava.io.BufferedInputStream;ImportJava.io.FileInputStream;ImportJava.io.IOException;ImportJava.io.InputStream;ImportJava.io.File;ImportJava.net.URI;ImportOrg.apache.hadoop.conf.Configuration;ImportOrg.apache.hadoop.fs.FileSystem;ImportOrg.apache.hadoop.fs.Path;ImportOrg.apache.hadoop.io.IOUtils;Importorg.apache.hadoop.io.NullWritable;ImportOrg.apache.hadoop.io.SequenceFile;ImportOrg.apache.hadoop.io.Text; Publicclasssequenecefile{ PublicStaticvoidMain (string[] args)throwsIOException {//string Seqfsurl = "Hdfs://localhost:9000/user/mjiang/target-seq/sdfgz.seq"; String Seqfsurl = "User/mjiang/target-seq/sdfgz.seq"; Configuration conf =NewConfiguration (); Conf.set ("Fs.default.name", "hdfs://venus:9000"); Conf.set ("Hadoop.job.user", "Mjiang"); Conf.set ("Mapred.job.tracker", "venus:9001"); FileSystem fs = Filesystem.get (Uri.create (Seqfsurl), conf); Path Seqpath =NewPath (Seqfsurl); Text key = new text (); Text value =NewText (); String Filespath = "/home/mjiang/java/eclipse/hadoop/sequencefile/data/sdfgz/"; File Gzfilesdir =NewFile (Filespath); string[] Gzfiles = Gzfilesdir.list ();intFileslen=gzfiles.length; Sequencefile.writer Writer =NULL;Try{//Returns an Sequencefile.writer instance that requires the data stream and the path object to write data to the path object writer = Seque Ncefile.createwriter (FS, Conf, seqpath,nullwritable.class, Value.getclass ()); for (int i=0;i<2;i++) { while(fileslen>0) {File Gzfile =NewFile (filespath+gzfiles[fileslen-1]); InputStream in =NewBufferedinputstream (NewFileInputStream (Gzfile));LongLen = Gzfile.length ();byte[] Buff =Newbyte[(int) Len];if ((Len = in.read (Buff))!= -1) { value.set (Buff); writer.append (Nullwritable.get (), value);//Append each record to the end of the Sequencefile.writer instance } //process system.out.println (gzfiles[fileslen-1]); //key.clear (); value.clear (); ioutils.closestream (in); fileslen--;//!! } //filesLen = 2; } } finally{Ioutils.closestream (writer); } } }

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Generate a Sequencefile file with a large number of small files under Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Generate a Sequencefile file with a large number of small files under Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support