Uploading Avro files to HDFs using flume

Source: Internet
Author: User
Tags serialization git clone
uploading Avro files to HDFs using flume

Scenario Description: Upload the Avro file under a folder to HDFs. Source uses HDFs, which is used by Spooldir,sink. Configure flume.conf

# memory channel called CH1 on Agent1 agent1.channels.ch1.type = memory # source Agent1.sources.spooldir-source1.channels = Ch1 Agent1.sources.spooldir-source1.type = Spooldir Agent1.sources.spooldir-source1.spooldir=/home/yang/data/avro
/Agent1.sources.spooldir-source1.basenameheader = True Agent1.sources.spooldir-source1.deserializer = AVRO
Agent1.sources.spooldir-source1.deserializer.schematype = LITERAL # sink Agent1.sinks.hdfs-sink1.channel = ch1
Agent1.sinks.hdfs-sink1.type = HDFs Agent1.sinks.hdfs-sink1.hdfs.path = hdfs://node1:8020/user/yang/test
Agent1.sinks.hdfs-sink1.hdfs.filetype = DataStream Agent1.sinks.hdfs-sink1.hdfs.filesuffix =. Avro
Agent1.sinks.hdfs-sink1.serializer = Org.apache.flume.serialization.avroeventserializer$builder
Agent1.sinks.hdfs-sink1.serializer.compressioncodec = Snappy Agent1.sinks.hdfs-sink1.hdfs.fileprefix =%{basename} Agent1.sinks.hdfs-sink1.hdfs.rollsize = 0 agent1.sinks.hdfs-sink1.hdfs.rollcount = 0 # Finally, now that we ' ve defined Al Lof our components, tell # agent1 which ones we want to activate. Agent1.channels = ch1 Agent1.sources = Spooldir-source1 Agent1.sinks = Hdfs-sink1

Note: Several of the above configurations require special attention. SOURCE section:

Agent1.sources.spooldir-source1.deserializer = AVRO
Agent1.sources.spooldir-source1.deserializer.schematype = LITERAL

Deserializer default is line, if not set Avro, will report an exception, because our file here is Avro file.

Deserializer.schematype default is hash, if not set to literal, will report the following exception: Process failed Org.apache.flume.FlumeException:Could not find Schema for event sink section:

Agent1.sinks.hdfs-sink1.hdfs.filetype = DataStream
agent1.sinks.hdfs-sink1.hdfs.filesuffix =. Avro
Agent1.sinks.hdfs-sink1.serializer =  Org.apache.flume.serialization.avroeventserializer$builder
Agent1.sinks.hdfs-sink1.serializer.compressioncodec = Snappy

Hdfs.filetype default is Sequencefile, if you use this file type, data transfer to HDFS, there will be unable to parse the Avro file normally, such as the exception not a Avro data file The hdfs.filesuffix is to indicate that the suffix name is appended to the file, note that the point (.) of the file suffix is not omitted, and why the suffix is appended to the name. Because in many cases, such as when using spark to read the Avro file, it will first determine the file suffix name, if not the. Avro end of the file, it will think that this is not a Avro file, and then throw an exception.

In particular, there is no hdfs in front of serializer, and this serialization class is not flume, it is necessary to download the source code and package it, this source on the GitHub project, Avroeventserializer$builder in the project location. We can use the Git clone project and then switch to directory Cdk-flume-avro-event-serializer, then mvn the package and copy the generated jar (in the target directory) to the Flume Lib directory.

Reference documents:

[1] http://flume.apache.org/FlumeUserGuide.html

[2] Http://stackoverflow.com/questions/21617025/flume-directory-to-avro-avro-to-hdfs-not-valid-avro-after-transfer?rq=1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.