Hadoop file-based data structures and examples

Last Update:2016-02-09 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

File-based data structures
Two file formats:
1, Sequencefile
2, MapFile

Sequencefile

1. sequencefile files are <key,value> flat files (Flat file) designed by Hadoop to store binary forms of pairs.

2, can sequencefile as a container, all the files packaged into the Sequencefile class can be efficiently stored and processed small files .

3. sequencefile files are not sorted according to their stored key. The Sequencefile internal class writer** provides the append functionality * *.

4. The key and value in Sequencefile can be arbitrary type writable or the writable type defined by itself.

Sequencefile Compression

1. The internal format of sequencefile depends on whether compression is enabled. Suppose it is. Either the record is compressed. either a block compression.
2, three kinds of types:
A. No compression type : Assuming that compression is not enabled (the default), then each record is made up of its record length (in bytes) and the length of the key. The key and value are composed. The Length field is four bytes.

B. Record compression type : The record compression format is basically the same as the uncompressed format. The difference is that the value byte is compressed with the encoder defined in the head. Note that the key is not compressed.

C. Block compression type : Block compression compresses multiple records at once . Therefore it is more compact than record compression and is generally preferred .

When the number of bytes recorded reaches the minimum size, the block is added.

The minimum value io.seqfile.compress.blocksize is defined by the property in.

The default value is 1000000 bytes.

The format is record count, key length, key, value length, value.

no compression format and record compression format

block compression format

advantages of the Sequencefile file format:
A Supports data compression based on records (record) or blocks (block).
B Supports splittable, which can be used as input shards for mapreduce.
C Simple change: The main responsibility is to change the corresponding business logic, regardless of the detailed storage format.

Disadvantages of the Sequencefile file format:
The downside is the need for a merge file process. And the merged files will not be easily viewable. because it is a binary file.

read/write Sequencefile

Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

Org.apache.hadoop.io Class Sequencefilethere isthreeSequencefile Writers basedOn the    sequencefile. Compressiontype used  to Compress key/value pairs:1, Writer:uncompressed Records.2, recordcompresswriter:record-compressedFiles, onlyCompressValues.3, blockcompresswriter:block-compressedFiles, bothKeys& Values is collectedinch ' Blocks 'Separately andCompressed. The size of  the ' Block 'is configurable

no compression, record compression, block compression instances

Package Sequencefile;Import Java. IO. IOException;Import Java. NET. URI;import org. Apache. Hadoop. conf. Configuration;import org. Apache. Hadoop. FS. FileSystem;import org. Apache. Hadoop. FS. Path;import org. Apache. Hadoop. IO. Ioutils;import org. Apache. Hadoop. IO. Intwritable;import org. Apache. Hadoop. IO. Sequencefile;import org. Apache. Hadoop. IO. Sequencefile. Compressiontype;import org. Apache. Hadoop. IO. Text;import org. Apache. Hadoop. IO. Writable;import org. Apache. Hadoop. IO. Compress. BZip2Codec;import org. Apache. Hadoop. Util. Reflectionutils;public class Demo01 {final static String uri="Hdfs://liguodong:8020/liguodong";Final static string[] data = {"Apache,software","Chinese,good","James,nba","Index,pass"};public static void Main (string[] args) throws IOException {//1Configuration configuration = new configuration ();//2FileSystem fs = FileSystem. Get(URI. Create(URI), configuration);//3Path PATH = new Path ("/tmp.seq");Write (Fs,configuration,path);Read (Fs,configuration,path);} public static void Write (FileSystem fs,configuration configuration,path Path) throws ioexception{//4Intwritable key = new Intwritable ();Text value = new text ();No compression/* @SuppressWarnings ("deprecation") sequencefile.writer Writer = Sequencefile.createwriter (Fs,co Nfiguration,path,key.getclass (), Value.getclass ()); * *Record Compression @SuppressWarnings ("Deprecation") Sequencefile. Writerwriter = Sequencefile. Createwriter(Fs,configuration,path,key. GetClass(), Value. GetClass(), Compressiontype. RECORD, new Bzip2codec ());Block compression/* @SuppressWarnings ("deprecation") sequencefile.writer Writer = Sequencefile.createwriter (Fs,co Nfiguration,path,key.getclass (), Value.getclass (), Compressiontype.block,new Bzip2codec ()); * *//5for (int i =0; i <; i++) {Key. Set( --I.);Value. Set(Data[i%data. Length]);Writer. Append(Key, value);}        //6, close the flow ioutils. Closestream(writer); } public static void Read (FileSystem fs,configuration Configuration,path Path) throws IOException {//4@SuppressWarnings ("Deprecation") Sequencefile. Readerreader = new Sequencefile. Reader(FS, path,configuration);//5Writable key = (writable) reflectionutils. newinstance(Reader. Getkeyclass(), configuration);Writable value = (writable) reflectionutils. newinstance(Reader. Getvalueclass(), configuration);while (reader. Next(Key,value)) {System. out. println("key ="+ key);System. out. println("value ="+ value);System. out. println("position ="+ Reader. GetPosition());} ioutils. Closestream(reader);}}

Execution Result:

Key= 100value= Apache,softwareposition= 164Key= 99value= Chinese,goodposition= 197Key= 98value= James,nbaposition= 227Key= 97value= Index,passposition= 258Key= 96value= Apache,softwareposition= 294Key= 95value= Chinese,goodposition= 327 ...Key= 72value= Apache,softwareposition= 1074Key= 71value= Chinese,goodposition= 1107

MapFile

publicclass MapFile {  /** The name of the index file. */  publicstaticfinal"index";  /** The name of the data file. */  publicstaticfinal"data";}

The mapfile is the sequencefile of the sorted index and can be found based on key.

Unlike the Sequencefile, it is. Mapfile key must implement the Writablecomparable interface, that is, the key value is comparable, and value is the writable type.
The Mapfile.fix () method can be used to reconstruct the index and convert the Sequencefile into mapfile.
It has two static member variables:

staticfinal String INDEX_FILE_NAMEstaticfinal String DATA_FILE_NAME

By observing its folder structure, we can see that mapfile consists of two parts, each of which is data and index.
Index, which is a data-indexed file, mainly records the key value of each record and the offset of the record in the file.

When Mapfile is interviewed, the index file is loaded into memory, and the index mapping relationship can quickly navigate to the file location where the record is specified.
So. Compared with sequencefile, the retrieval efficiency of mapfile is efficient. The disadvantage is that some memory is consumed to store the index data.
It should be noted that the Mapfile does not record all records in Index, and by default stores an index map for every 128 records. Of course, the recording interval can be artificially altered through the MapFIle.Writer的setIndexInterval() method. or change io.map.index.interval properties.

read/write Mapfile

Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

The detailed operation is similar to Sequencefile.

command line view binary file
hdfs dfs -text /liguodong/tmp.seq

Hadoop file-based data structures and examples

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More