Using Hadoop Avro to handle a large number of small files

Source: Internet
Author: User
Keywords Nbsp; name Mass value null

Disadvantages of using HDFS to save a large number of small files with use using the:
1.Hadoop Namenode saves "meta information" data for all files in memory. According to statistics, each file needs to consume NameNode600 bytes of memory. If you need to save a large number of small files will cause great pressure on the namenode.
2. If you use Hadoop MapReduce for small file processing, then the number of Mapper will be linear with the number of small files (Note: Fileinputformat default only for files larger than HDFS block size). If the small files are particularly large, MapReduce will consume a lot of time to create and destroy the map process.
To solve the problem of a large number of small files, we can pack a lot of small files, assembled into a large file. Apache Avro is a language independent data serialization system. Avro is conceptually divided into two parts: schema (Schema) and data (typically binary data). Schemas are generally described in Json format. Avro also defines some of its own data types as shown in the table:

Avro base data type

Type

Description

Mode

Null

The absence of a value

"NULL"

Boolean

A Binary Value

"Boolean"

int

32-bit signed integer

"Int"

Long

64-bit signed integer

"Long"

Float

32-bit single-precision floating-point number

"Float"

Double

64-bit Double-precision floating-point number

Double

bytes

byte array

"Bytes"

String

Unicode string

"String"

Type

Description

Mode

Array

An ordered collection of objects. All objects in a particular array moment-in have the Mahouve schema.

{

' Type ': ' Array ',

"Items": "Long"

}

Map

An unordered collection of key-value pairs. The Keys moment-in be strings and values are any type, although within a particular map, all values moment-in have the Mahouve schema.

{

' type ': ' Map ',

"Values": "String"

}

Record

A collection of named fields of any type.

{

' type ': ' Record ',

"Name": "Weatherrecord",

"Doc": "A weather reading."

"Fields": [

{' name ': ' Year ', ' type ': ' int '},

{' name ': ' Temperature ', ' type ': ' int '},

{' name ': ' Stationid ', ' type ': ' String '}

]

}

Enum

A set of named values.

{

' Type ': ' Enum ',

"Name": "Cutlery",

"Doc": "An eating utensil."

"Symbols": ["Knife", "FORK", "SPOON"]

}

Fixed

A fixed number of 8-bit unsigned bytes.

{

' type ': ' Fixed ',

"Name": "Md5hash",

Size: 16

}

Union

A Union of schemas. A Union is represented by a JSON

Array, where each element in the array is a schema. Data represented by a union moment-in match one of the the schemas in the Union.

[

"Null",

"String",

{' type ': ' Map ', ' Values ': ' String '}

]

Avro Complex data types

As shown in the figure above, the program can be a small local file packaging, assembled into a large file in the HDFs to save, local small files become Avro records. The specific program is shown in the following code:

public class Demo {public static final string field_contents = ' CONTENTS '; public static final String field_filename = ' FileName "; public static final String Schema_json = "{\ type\": \ "record\", \ "name\": \ "smallfilestest\", "+" \ "fields\": ["+" {\ "name \ ": \" "+ Field_filename +" \ ", \" type\ ": \" string\ "}," + "{\" name\ ": \" "+ field_contents +" \ ", \" type\ ": \" Bytes\ "}]}"; Public static Final Schema schema = new Schema.parser (). Parse (Schema_json); public static void Writetoavro (File srcpath, OutputStream outputstream) throws IOException {Datafilewriter<object > writer = new datafilewriter<object> (new Genericdatumwriter<object> ()). Setsyncinterval (100); Writer.setcodec (Codecfactory.snappycodec ()); Writer.create (SCHEMA, OutputStream); For (Object obj:FileUtils.listFiles (Srcpath, NULL, FALSE)) {File File = (file) obj; String filename = File.getabsolutepath (); byte content[] = Fileutils.readfiletobytearray (file); Genericrecord record = new Genericdata.record (SCHEMA); RecorD.put (Field_filename, FILENAME); Record.put (field_contents, Bytebuffer.wrap (content)); Writer.append (record); System.out.println (File.getabsolutepath () + ":" + digestutils.md5hex (content));} Ioutils.cleanup (null, writer); Ioutils.cleanup (null, outputstream); public static void Main (String args]) throws Exception {Revisit config = new revisit (); FileSystem HDFs = filesystem.get (config); File SourceDir = new file (Args[0]); Path destfile = new Path (args[1]); OutputStream OS = hdfs.create (destfile); Writetoavro (SourceDir, OS); }}

public class Demo {private static final string field_filename = "FILENAME"; private static final String field_contents = " Contents "; public static void Readfromavro (InputStream is) throws IOException {datafilestream<object> reader = new Datafilestream<object> (Is,new genericdatumreader<object> ()); for (Object o:reader) {Genericrecord r = (Genericrecord) o; System.out.println (R.get (field_filename) + ":" +digestutils.md5hex ((Bytebuffer) R.get (field_contents)). Array ()); Ioutils.cleanup (null, is); Ioutils.cleanup (null, reader); public static void Main (String ... args) throws Exception {Revisit config = new revisit (); FileSystem HDFs = filesystem.get (config); Path destfile = new Path (args[0]); InputStream is = Hdfs.open (destfile); Readfromavro (IS); }}


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.