uploading Avro files to HDFs using flume
Scenario Description: Upload the Avro file under a folder to HDFs. Source uses HDFs, which is used by Spooldir,sink. Configure flume.conf
# memory channel called CH1 on Agent1 agent1.channels.ch1.type = memory # source Agent1.sources.spooldir-source1.channels = Ch1 Agent1.sources.spooldir-source1.type = Spooldir Agent1.sources.spooldir-source1.spooldir=/home/yang/data/avro
/Agent1.sources.spooldir-source1.basenameheader = True Agent1.sources.spooldir-source1.deserializer = AVRO
Agent1.sources.spooldir-source1.deserializer.schematype = LITERAL # sink Agent1.sinks.hdfs-sink1.channel = ch1
Agent1.sinks.hdfs-sink1.type = HDFs Agent1.sinks.hdfs-sink1.hdfs.path = hdfs://node1:8020/user/yang/test
Agent1.sinks.hdfs-sink1.hdfs.filetype = DataStream Agent1.sinks.hdfs-sink1.hdfs.filesuffix =. Avro
Agent1.sinks.hdfs-sink1.serializer = Org.apache.flume.serialization.avroeventserializer$builder
Agent1.sinks.hdfs-sink1.serializer.compressioncodec = Snappy Agent1.sinks.hdfs-sink1.hdfs.fileprefix =%{basename} Agent1.sinks.hdfs-sink1.hdfs.rollsize = 0 agent1.sinks.hdfs-sink1.hdfs.rollcount = 0 # Finally, now that we ' ve defined Al Lof our components, tell # agent1 which ones we want to activate. Agent1.channels = ch1 Agent1.sources = Spooldir-source1 Agent1.sinks = Hdfs-sink1
Note: Several of the above configurations require special attention. SOURCE section:
Agent1.sources.spooldir-source1.deserializer = AVRO
Agent1.sources.spooldir-source1.deserializer.schematype = LITERAL
Deserializer default is line, if not set Avro, will report an exception, because our file here is Avro file.
Deserializer.schematype default is hash, if not set to literal, will report the following exception: Process failed Org.apache.flume.FlumeException:Could not find Schema for event sink section:
Agent1.sinks.hdfs-sink1.hdfs.filetype = DataStream
agent1.sinks.hdfs-sink1.hdfs.filesuffix =. Avro
Agent1.sinks.hdfs-sink1.serializer = Org.apache.flume.serialization.avroeventserializer$builder
Agent1.sinks.hdfs-sink1.serializer.compressioncodec = Snappy
Hdfs.filetype default is Sequencefile, if you use this file type, data transfer to HDFS, there will be unable to parse the Avro file normally, such as the exception not a Avro data file The hdfs.filesuffix is to indicate that the suffix name is appended to the file, note that the point (.) of the file suffix is not omitted, and why the suffix is appended to the name. Because in many cases, such as when using spark to read the Avro file, it will first determine the file suffix name, if not the. Avro end of the file, it will think that this is not a Avro file, and then throw an exception.
In particular, there is no hdfs in front of serializer, and this serialization class is not flume, it is necessary to download the source code and package it, this source on the GitHub project, Avroeventserializer$builder in the project location. We can use the Git clone project and then switch to directory Cdk-flume-avro-event-serializer, then mvn the package and copy the generated jar (in the target directory) to the Flume Lib directory.
Reference documents:
[1] http://flume.apache.org/FlumeUserGuide.html
[2] Http://stackoverflow.com/questions/21617025/flume-directory-to-avro-avro-to-hdfs-not-valid-avro-after-transfer?rq=1