Uploading Avro files to HDFs using flume

Last Update:2018-08-04 Source: Internet

Author: User

Tags serialization git clone

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

uploading Avro files to HDFs using flume

Scenario Description: Upload the Avro file under a folder to HDFs. Source uses HDFs, which is used by Spooldir,sink. Configure flume.conf

# memory channel called CH1 on Agent1 agent1.channels.ch1.type = memory # source Agent1.sources.spooldir-source1.channels = Ch1 Agent1.sources.spooldir-source1.type = Spooldir Agent1.sources.spooldir-source1.spooldir=/home/yang/data/avro
/Agent1.sources.spooldir-source1.basenameheader = True Agent1.sources.spooldir-source1.deserializer = AVRO
Agent1.sources.spooldir-source1.deserializer.schematype = LITERAL # sink Agent1.sinks.hdfs-sink1.channel = ch1
Agent1.sinks.hdfs-sink1.type = HDFs Agent1.sinks.hdfs-sink1.hdfs.path = hdfs://node1:8020/user/yang/test
Agent1.sinks.hdfs-sink1.hdfs.filetype = DataStream Agent1.sinks.hdfs-sink1.hdfs.filesuffix =. Avro
Agent1.sinks.hdfs-sink1.serializer = Org.apache.flume.serialization.avroeventserializer$builder
Agent1.sinks.hdfs-sink1.serializer.compressioncodec = Snappy Agent1.sinks.hdfs-sink1.hdfs.fileprefix =%{basename} Agent1.sinks.hdfs-sink1.hdfs.rollsize = 0 agent1.sinks.hdfs-sink1.hdfs.rollcount = 0 # Finally, now that we ' ve defined Al Lof our components, tell # agent1 which ones we want to activate. Agent1.channels = ch1 Agent1.sources = Spooldir-source1 Agent1.sinks = Hdfs-sink1

Note: Several of the above configurations require special attention. SOURCE section:

Agent1.sources.spooldir-source1.deserializer = AVRO
Agent1.sources.spooldir-source1.deserializer.schematype = LITERAL

Deserializer default is line, if not set Avro, will report an exception, because our file here is Avro file.

Deserializer.schematype default is hash, if not set to literal, will report the following exception: Process failed Org.apache.flume.FlumeException:Could not find Schema for event sink section:

Agent1.sinks.hdfs-sink1.hdfs.filetype = DataStream
agent1.sinks.hdfs-sink1.hdfs.filesuffix =. Avro
Agent1.sinks.hdfs-sink1.serializer =  Org.apache.flume.serialization.avroeventserializer$builder
Agent1.sinks.hdfs-sink1.serializer.compressioncodec = Snappy

Hdfs.filetype default is Sequencefile, if you use this file type, data transfer to HDFS, there will be unable to parse the Avro file normally, such as the exception not a Avro data file The hdfs.filesuffix is to indicate that the suffix name is appended to the file, note that the point (.) of the file suffix is not omitted, and why the suffix is appended to the name. Because in many cases, such as when using spark to read the Avro file, it will first determine the file suffix name, if not the. Avro end of the file, it will think that this is not a Avro file, and then throw an exception.

In particular, there is no hdfs in front of serializer, and this serialization class is not flume, it is necessary to download the source code and package it, this source on the GitHub project, Avroeventserializer$builder in the project location. We can use the Git clone project and then switch to directory Cdk-flume-avro-event-serializer, then mvn the package and copy the generated jar (in the target directory) to the Flume Lib directory.

Reference documents:

[1] http://flume.apache.org/FlumeUserGuide.html

[2] Http://stackoverflow.com/questions/21617025/flume-directory-to-avro-avro-to-hdfs-not-valid-avro-after-transfer?rq=1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More