Using Hadoop Avro to handle a large number of small files
Source: Internet
Author: User
KeywordsNbsp;name Mass valuenull
Disadvantages of using HDFS to save a large number of small files with use using the: 1.Hadoop Namenode saves "meta information" data for all files in memory. According to statistics, each file needs to consume NameNode600 bytes of memory. If you need to save a large number of small files will cause great pressure on the namenode. 2. If you use Hadoop MapReduce for small file processing, then the number of Mapper will be linear with the number of small files (Note: Fileinputformat default only for files larger than HDFS block size). If the small files are particularly large, MapReduce will consume a lot of time to create and destroy the map process. To solve the problem of a large number of small files, we can pack a lot of small files, assembled into a large file. Apache Avro is a language independent data serialization system. Avro is conceptually divided into two parts: schema (Schema) and data (typically binary data). Schemas are generally described in Json format. Avro also defines some of its own data types as shown in the table:
Avro base data type
Type
Description
Mode
Null
The absence of a value
"NULL"
Boolean
A Binary Value
"Boolean"
int
32-bit signed integer
"Int"
Long
64-bit signed integer
"Long"
Float
32-bit single-precision floating-point number
"Float"
Double
64-bit Double-precision floating-point number
Double
bytes
byte array
"Bytes"
String
Unicode string
"String"
Type
Description
Mode
Array
An ordered collection of objects. All objects in a particular array moment-in have the Mahouve schema.
{
' Type ': ' Array ',
"Items": "Long"
}
Map
An unordered collection of key-value pairs. The Keys moment-in be strings and values are any type, although within a particular map, all values moment-in have the Mahouve schema.
{
' type ': ' Map ',
"Values": "String"
}
Record
A collection of named fields of any type.
{
' type ': ' Record ',
"Name": "Weatherrecord",
"Doc": "A weather reading."
"Fields": [
{' name ': ' Year ', ' type ': ' int '},
{' name ': ' Temperature ', ' type ': ' int '},
{' name ': ' Stationid ', ' type ': ' String '}
]
}
Enum
A set of named values.
{
' Type ': ' Enum ',
"Name": "Cutlery",
"Doc": "An eating utensil."
"Symbols": ["Knife", "FORK", "SPOON"]
}
Fixed
A fixed number of 8-bit unsigned bytes.
{
' type ': ' Fixed ',
"Name": "Md5hash",
Size: 16
}
Union
A Union of schemas. A Union is represented by a JSON
Array, where each element in the array is a schema. Data represented by a union moment-in match one of the the schemas in the Union.
[
"Null",
"String",
{' type ': ' Map ', ' Values ': ' String '}
]
Avro Complex data types
As shown in the figure above, the program can be a small local file packaging, assembled into a large file in the HDFs to save, local small files become Avro records. The specific program is shown in the following code:
public class Demo {public static final string field_contents = ' CONTENTS '; public static final String field_filename = ' FileName "; public static final String Schema_json = "{\ type\": \ "record\", \ "name\": \ "smallfilestest\", "+" \ "fields\": ["+" {\ "name \ ": \" "+ Field_filename +" \ ", \" type\ ": \" string\ "}," + "{\" name\ ": \" "+ field_contents +" \ ", \" type\ ": \" Bytes\ "}]}"; Public static Final Schema schema = new Schema.parser (). Parse (Schema_json); public static void Writetoavro (File srcpath, OutputStream outputstream) throws IOException {Datafilewriter<object > writer = new datafilewriter<object> (new Genericdatumwriter<object> ()). Setsyncinterval (100); Writer.setcodec (Codecfactory.snappycodec ()); Writer.create (SCHEMA, OutputStream); For (Object obj:FileUtils.listFiles (Srcpath, NULL, FALSE)) {File File = (file) obj; String filename = File.getabsolutepath (); byte content[] = Fileutils.readfiletobytearray (file); Genericrecord record = new Genericdata.record (SCHEMA); RecorD.put (Field_filename, FILENAME); Record.put (field_contents, Bytebuffer.wrap (content)); Writer.append (record); System.out.println (File.getabsolutepath () + ":" + digestutils.md5hex (content));} Ioutils.cleanup (null, writer); Ioutils.cleanup (null, outputstream); public static void Main (String args]) throws Exception {Revisit config = new revisit (); FileSystem HDFs = filesystem.get (config); File SourceDir = new file (Args[0]); Path destfile = new Path (args[1]); OutputStream OS = hdfs.create (destfile); Writetoavro (SourceDir, OS); }}
public class Demo {private static final string field_filename = "FILENAME"; private static final String field_contents = " Contents "; public static void Readfromavro (InputStream is) throws IOException {datafilestream<object> reader = new Datafilestream<object> (Is,new genericdatumreader<object> ()); for (Object o:reader) {Genericrecord r = (Genericrecord) o; System.out.println (R.get (field_filename) + ":" +digestutils.md5hex ((Bytebuffer) R.get (field_contents)). Array ()); Ioutils.cleanup (null, is); Ioutils.cleanup (null, reader); public static void Main (String ... args) throws Exception {Revisit config = new revisit (); FileSystem HDFs = filesystem.get (config); Path destfile = new Path (args[0]); InputStream is = Hdfs.open (destfile); Readfromavro (IS); }}
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.