Hadoop file-based data structures and examples

Last Update:2015-06-07 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

File-based data structures
Two file formats:
1, Sequencefile
2, MapFile

Sequencefile

1. sequencefile files are <key,value>flat files (Flat file) designed by Hadoop to store binary forms of pairs.

2, can sequencefile as a container, all the files packaged into the Sequencefile class can be efficiently stored and processed small files .

3. sequencefile files are not sorted by their stored key, Sequencefile's internal class writer** provides append functionality * *.

4. The key and value in Sequencefile can be any type writable or a custom writable type.

Sequencefile Compression

1. The internal format of the sequencefile depends on whether compression is enabled, or, if it is, either a record compression or a block compression.
2, three kinds of types:
A. No compression type : If compression is not enabled (the default setting), then each record consists of its record length (number of bytes), the length of the key, the key and the value. The Length field is four bytes.
B. Record compression type : The record compression format is basically the same as the uncompressed format, and the difference is that the value byte is compressed with the encoder defined in the header. Note that the key is not compressed.
C. Block compression type : Block compression compresses multiple records at once , so it is more compact than record compression and generally preferred . When the number of bytes recorded reaches the minimum size, it is added to the block. The minimum valueio.seqfile.compress.blocksizeis defined by the property in. The default value is 1000000 bytes. The format is record count, key length, key, value length, value.

no compression format and record compression format

block compression format

Benefits of the Sequencefile file format:
A. Supports data compression based on records (record) or blocks (block).
B. Supports splittable, which can be used as input shards for mapreduce.
C. Simple to modify: The main responsibility is to modify the corresponding business logic, regardless of the specific storage format.

Disadvantages of the Sequencefile file format:
The downside is the need for a merge file, and the merged file will be inconvenient to view. because it is a binary file.

read/write Sequencefile

Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) Call Sequencefile.createwriter to get Sequencefile.writer object
5) Call SequenceFile.Writer.append Append write file
6) Close the stream
Read process:

1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one sequencefile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream





org.apache.hadoop.io 
Class SequenceFile
There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs: 
1、Writer : Uncompressed records. 
2、RecordCompressWriter : Record-compressed files, only compress values. 
3、BlockCompressWriter : Block-compressed files, both keys & values are collected in ‘blocks‘ separately and compressed. The size of the ‘block‘ is configurable

package SequenceFile;

import java.io.IOException;
import java.net.URI;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.util.ReflectionUtils;

public class Demo01 {

    final static String uri= "hdfs://liguodong:8020/liguodong";
    final static String[] data = {
        "apache,software","chinese,good","james,NBA","index,pass"
    };

    public static void main(String[] args) throws IOException {
        //1
        Configuration configuration = new Configuration();
        //2
        FileSystem fs = FileSystem.get(URI.create(uri),configuration);
        //3
        Path path = new Path("/tmp.seq");

        write(fs,configuration,path);
        read(fs,configuration,path);

    }

    public static void write(FileSystem fs,Configuration configuration,Path path) throws IOException{
        //4
        IntWritable key = new IntWritable();
        Text value = new Text();
        //No compression
        /*@SuppressWarnings("deprecation")
        SequenceFile.Writer writer = SequenceFile.createWriter
                (fs,configuration,path,key.getClass(),value.getClass());*/
        //Record compression
        @SuppressWarnings("deprecation")
        SequenceFile.Writer writer = SequenceFile.createWriter
                (fs,configuration,path,key.getClass(),
                        value.getClass(), CompressionType.RECORD, new BZip2Codec());
        //Block compression
        /*@SuppressWarnings("deprecation")
        SequenceFile.Writer writer = SequenceFile.createWriter
                (fs,configuration,path,key.getClass(),
                value.getClass(),CompressionType.BLOCK,new BZip2Codec());*/

        //5
        for (int i = 0; i <30; i++) {
            key.set(100-i);
            value.set(data[i%data.length]);
            writer.append(key, value);
        }
        //6, close the stream
        IOUtils.closeStream(writer);
    }

    public static void read(FileSystem fs,Configuration configuration,Path path) throws IOException {
        //4
        @SuppressWarnings("deprecation")
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path,configuration);
        //5
        Writable key = (Writable) ReflectionUtils.newInstance
                (reader.getKeyClass(), configuration);
        Writable value = (Writable) ReflectionUtils.newInstance
                (reader.getValueClass(), configuration);

        while(reader.next(key,value)){
            System.out.println("key = "+ key);
            System.out.println("value = "+ value);
            System.out.println("position = "+ reader.getPosition());
        }
        IOUtils.closeStream(reader);
    }
}

Operation Result:

key = 100
value = apache,software
position = 164
key = 99
value = chinese,good
position = 197
key = 98
value = james,NBA
position = 227
key = 97
value = index,pass
position = 258
key = 96
value = apache,software
position = 294
key = 95
value = chinese,good
position = 327
......
key = 72
value = apache,software
position = 1074
key = 71
value = chinese,good
position = 1107

MapFile


 
public class MapFile {
  /** The name of the index file. */
  public static final String INDEX_FILE_NAME = "index";

  /** The name of the data file. */
  public static final String DATA_FILE_NAME = "data";
}

Mapfile is the sequencefile of the sorted index and can be found based on key.

Unlike Sequencefile, mapfile key must implement Writablecomparable interface, that is, the key value is comparable, and value is the writable type.
You can use the Mapfile.fix () method to reconstruct the index and convert the Sequencefile to mapfile.
It has two static member variables:



 
static final String INDEX_FILE_NAME
static final String DATA_FILE_NAME

By observing its directory structure, we can see that mapfile consists of two parts, data and index respectively.
Index, which is a data-indexed file, primarily records the key value of each record and the position at which the record is offset in the file.

When mapfile is accessed, the index file is loaded into memory, and the index mapping relationship quickly navigates to the location of the file where the specified record is located.
Therefore, the retrieval efficiency of mapfile is efficient relative to sequencefile, and the disadvantage is that it consumes a portion of memory to store index data.
It should be noted that the Mapfile does not record all records into index, which by default stores an index map for every 128 records. Of course, the recording interval can be artificially modified, throughMapFIle.Writer的setIndexInterval()methods, or modifiedio.map.index.intervalattributes;

read/write Mapfile

Write Process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one Mapfile.writer object
5) Call MapFile.Writer.append Append write file
6) Close the stream
Read process:
1) Create a configuration
2) Get filesystem
3) Create file output path
4) New one mapfile.reader for reading
5) Get Keyclass and Valueclass
6) Close the stream

The specific operation is similar to Sequencefile.

command line view binary file
hdfs dfs -text /liguodong/tmp.seq

Hadoop file-based data structures and examples

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More