Flume Usage Summary

Source: Internet
Author: User
Tags bulk insert hadoop ecosystem

This article describes the initial process of using flume to transfer data to MongoDB, covering environment deployment and considerations.

1 Environment Construction

requires JDK, flume-ng, MongoDB java driver, Flume-ng-mongodb-sink
(1) jdk:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
(2) flune-ng:http://www.apache.org/dyn/closer.cgi/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz
(3) MongoDB Java driver jar package: https://oss.sonatype.org/content/repositories/releases/org/mongodb/ Mongo-java-driver/2.13.0/mongo-java-driver-2.13.0.jar
(4) Flume-ng-mongodb-sink Source: Https://github.com/leonlee/flume-ng-mongodb-sink
Flume-ng-mongodb-sink needs to compile the jar itself, download the code from GitHub, unzip it and execute the MVN package and build. Maven needs to be installed to compile the jar package, and the machine needs to be networked.

2 Introduction to Simple principles

This is a story about the pond. There is a pool, it is a water, the other end of the water, the inlet can be configured with a variety of pipes, outlet can also be configured with a variety of pipes, can have a plurality of water inlet, a plurality of outlets. The term water is called the event, the inlet term is called Source, the spout term becomes Sink, the pool term becomes channel,source+channel+sink, the term is called the agent. If necessary, you can also connect multiple agents together.
For more details, refer to the official documentation: http://flume.apache.org/FlumeDeveloperGuide.html

3 Flume Configuration

(1) Env configuration

Put the Mongo-java-driver and flume-ng-mongodb-sink two jar packages into the Flume\lib directory and add the path to the Flume_classpath variable of the flume-env.sh file;
Java_opts variable: Plus-dflume.monitoring.port=xxxx, you can see the monitoring information on [hostname:xxxx]/metrics;-xms specify JVM initial memory,-xmx specify JVM max memory
Flume_home variable: Set FLUME root directory
Java_home variable: Setting the Java root directory

(2) Log configuration

When debugging, set the log to debug and hit the file: Flume.root.logger=debug,logfile

(3) Transmission configuration
Use Exec Source, File-channel, Flume-ng-mongodb-sink.
Source Configuration Example:

My_agent.sources.my_source_1.channels =My_channel_1my_agent.sources.my_source_1.type=Execmy_agent.sources.my_source_1.command=python Xxx.pymy_agent.sources.my_source_1.shell=/bin/bash-Cmy_agent.sources.my_source_1.restartThrottle= 10000My_agent.sources.my_source_1.restart=trueMY_AGENT.SOURCES.MY_SOURCE_1.LOGSTDERR=truemy_agent.sources.my_source_1.batchSize= 1000my_agent.sources.my_source_1.interceptors=i1 i2 I3my_agent.sources.my_source_1.interceptors.i1.type=StaticMy_agent.sources.my_source_1.interceptors.i1.key=Dbmy_agent.sources.my_source_1.interceptors.i1.value=Cswuyg_testmy_agent.sources.my_source_1.interceptors.i2.type=StaticMy_agent.sources.my_source_1.interceptors.i2.key=Collectionmy_agent.sources.my_source_1.interceptors.i2.value=Cswuyg_testmy_agent.sources.my_source_1.interceptors.i3.type=StaticMy_agent.sources.my_source_1.interceptors.i3.key=Opmy_agent.sources.my_source_1.interceptors.i3.value= Upsert

Field Description:
With exec source, specifying the Execute command behavior Python xxx.py, I process the log in the xxx.py code, and print out JSON-formatted data in accordance with the Flume-ng-mongodb-sink Convention, if the update class operation must carry _id field, print out of the log is treated as the body of the event, I again through the interceptors to add a custom event Header;
The static interceptors is used to add information to the event header, and here I add db=cswuyg_test, Collection=cswuyg_test, Op=upsert, These three keys are agreed with Flume-ng-mongodb-sink to specify the DB, collection name in MongoDB, and the operation type is update.

Channel Configuration examples:

My_agent.channels.my_channel_1.type =Filemy_agent.channels.my_channel_1.checkpointDir=/home/work/flume/file-channel/my_channel_1/checkPointmy_agent.channels.my_channel_1.useDualCheckpoints=trueMy_agent.channels.my_channel_1.backupCheckpointDir=/home/work/flume/file-channel/my_channel_1/CheckPoint2my_agent.channels.my_channel_1.dataDirs=/home/work/flume/file-channel/my_channel_1/datamy_agent.channels.my_channel_1.transactionCapacity= 10000My_agent.channels.my_channel_1.checkpointInterval= 30000my_agent.channels.my_channel_1.maxFileSize= 4292870142My_agent.channels.my_channel_1.minimumRequiredSpace= 524288000my_agent.channels.my_channel_1.capacity= 100000

Field Description:

The parameter to note is capacity, which specifies the number of event numbers that can be stored in the pool, and you need to set an appropriate value based on the log volume, if you also use File-channel, and the disk is sufficient, it can be as large as possible.
Datadirs Specify the location of the pool storage, if possible, select a disk that is not so high, and use a comma to separate multiple disk directories.

Sink Configuration Example:

My_agent.sinks.my_mongo_1.type == = = = Ten= _s

Field Description:

Model selects dynamic, which means that the DB, collection name of MongoDB is the name specified in the event header. The Timestampfield field is used to convert the value of the specified key in the JSON string to a DateTime format Mongodb,flume-ng-mongodb-sink does not support nested key designations such as: _ S.Y), but can be implemented by modifying the sink code itself.

Agent Configuration Example:

My_agent.channels == = My_mongo_1

(4) Start

You can write a control.sh script to control the startup, shutdown, and restart of Flume.
Launch Demo:
./bin/flume-ng agent--conf./conf/--conf-file./conf/flume.conf-n agent1 >/start.log 2>&1 &


Since then, the log data from the log file, read through the xxx.py, into the Flie-channel, and then be Flume-ng-mongodb-sink read away, into the destination MongoDB Cluster.
After the basic function, we need to adjust the xxx.py and enhance the flume-ng-mongodb-sink.

4 other

1, monitoring : The official recommended better monitoring is ganglia:http://sourceforge.net/projects/ganglia/, there is an image interface.

2, version change : Flume from 1. X is no longer using zookeeper, and it provides E2E (end-to-end) support for data reliability, removing the DFO (store on Failure) and being (best effort) before refactoring. E2E refers to the guarantee that the event has been passed to the next agent or end point when deleting an event in the channel, but there is no mention of how the data is guaranteed not to be lost before entering the channel, like exec The way the source data is imported into the channel requires the user's own assurance.

3. Close Plug-in : When using exec source, the Flume reboot does not close the old plugin process and needs to be shut down.

4,Exec source does not guarantee that the data is not lost , because this way is only to pour water into the pond, no matter what the condition of the pond, see https://flume.apache.org/FlumeUserGuide.html# The Warning part of the Exec-source. However,spooling directory source is not necessarily a good way to monitor the directory, but note that you cannot modify the name of the file, you cannot overwrite the file with the same name, and you do not have half-content files. After the transfer is complete, the file is renamed to XX. Completed, a timed cleanup script is required to clean up these files. restarting causes duplicate event, because the files that are transferred to half are not set to the completed state.

5. Transmission bottleneck : using FLUME+MONGODB to securely transmit large amounts of data, bottlenecks can occur on MongoDB, especially in the type of update data.

6, need to modify the current Flume-ng-mongodb-sink plug-in : (1) Let update support $setOnInsert, (2) to resolve the update $set, $inc is empty, raised exception bug ; (3) When a bulk insert is resolved, a bug in which subsequent logs of the same batch insert are discarded because one of the logs has duplicate exception.

7,Flume and fluentd very similar , but from the Hadoop ecosystem flume more popular, so I choose Flume.

8, Batch deployment: first the JDK, Flume packaged into tar, and then the use of Python Paramiko Library, the tar package sent to each machine, decompression, operation.

where this article is located: http://www.cnblogs.com/cswuyg/p/4498804.html
Reference:

1, http://flume.apache.org/FlumeDeveloperGuide.html

2, "Apache flume:distributed Log Collection for Hadoop"

Flume Usage Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.