Flume-ng some precautions

Last Update:2015-11-24 Source: Internet

Author: User

Tags failover

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: Kee flume-ng some precautions

Here only to consider some of the flume itself, for the JVM, HDFS, HBase and so on are not involved ....

First, about Source:

1, Spool-source: Suitable for static files, that is, the file itself is not dynamic change;

2. Avro source can increase the number of threads appropriately to improve this source performance;

3, Thriftsource in the use of a problem to note that the use of bulk operations when an exception does not print the exception content but "Thrift source%s could not append events to the channel." This is because in the source code in the exception, it does not catch the exception, but to get the component name, this is a bug in the source code, it can also be explained that thrift very few people use, otherwise this problem will not exist in many versions;

4, if a source corresponds to multiple channel, the default is that each channel is the same data, will copy the batch of data n sent to the N channel, so if a channel full will affect the overall speed of the OH;

5, Execsource Official document has been described as asynchronous, may lose data Oh, try to use TAIL-F, attention is uppercase;

Ii. About channel:

1, the collection node recommends the use of the new composite type of Spillablememorychannel, summary node recommended memory channel, the actual amount of data, in general, more than 120MB per minute of data size of flume Agents are advised to use memory channel (self-measured file channel processing rate is probably 2m/s, different machines, different environments may be different, here only for reference), because once this agent channel overflow situation, will lead to large Most of the time is in file channel (Spillablememorychannel itself is a subclass of file channel, and the compound channel guarantees a certain order of event so that after reading the data in memory, we need to take out the overflow. Maybe the memory is full and will overflow ... ), the performance is greatly reduced, the summary once become such consequences imaginable;

2, adjust memory to occupy physical memory space, need two parameters Bytecapacitybufferpercentage (default is 20) and bytecapacity (default is the JVM maximum available memory of 0.8) to control, The calculation formula is: bytecapacity = (int) ((Context.getlong ("bytecapacity", defaultbytecapacity). Longvalue () * (1- Bytecapacitybufferpercentage *)/bytecapacityslotsize), it is clear that these two parameters can be adjusted to control, as for the bytecapacityslotsize default is 100, will be physically To the number of slots (slots), which is easy to manage, but may waste space, at least I think ... ；

3, there is a useful parameter "keep-alive" This parameter is used to control the channel full when the impact of the source of the transmission, channel empty impact sink consumption, is the wait time, the default is 3s, more than this time to throw abnormal, generally do not need to configure, But some things are useful, for example, you have to set the scene at the beginning of every minute to send a data, when the beginning of each minute may be larger, the latter will be smaller, then you can adjust the parameters, so that the channel is full of conditions;

Third, about sink:

1, Avro sink batch-size can be set larger, the default is 100, the increase will reduce the number of RPC, improve performance;

2, built-in HDFs sink parsing time stamp to set the directory or file prefix very lossy performance, because it is based on regular to match, you can modify the source to replace the parsing time function to greatly improve performance, I will write an article later to specifically explain the problem;

3, Rollingfilesink file name can not be customized, and can not be scheduled to scroll files, only by time interval scrolling, you can define sink, to do the timing of writing files;

4, HDFs sink the file name of the timestamp portion can not be omitted, the prefix, suffix and the file is being written, and other information; "Hdfs.idletimeout" This parameter is very meaningful, refers to the HDFs file is being written for how long to close the file, it is recommended to configure, For example, you set the parsing time stamp to keep different directories, file names, and Rollinterval=0, rollcount=0, rollsize=1000000, if the amount of data in this time does not reach the requirements of rollsize and subsequent write to the new file, is always open, similar situations may not pay attention to a lot of words, "hdfs.calltimeout" This parameter refers to each HDFS operation (read, write, open, close, etc.) the maximum operating time specified, each operation will be put into "hdfs.threadspoolsize" A thread in the specified thread pool to operate on;

5, about hbase sink (non-asynchronous HBase sink:asynchbasesink), Rowkey can not be customized, and a serializer can only write a column, a serializer by regular matching multiple columns, performance may be problematic, Suggest that you write an hbase sink according to your needs;

6, Avro Sink can be configured failover and loadbalance, the components used are the same as in Sinkgroup, and can also be configured in this compression option, you need to configure the decompression in Avro source;

Iv. about Sinkgroup:

1, whether it is loadbalance or failover multiple sink need to share a channel;

2, loadbalance multiple sink if all are directly output to the same device, such as HDFS, performance will not be significantly increased, because Sinkgroup is a single-threaded process method will call each sink to take data in the channel, and ensure that the processing is correct, so that it is sequential operation, but if it is sent to the next level of the Flume agent is not the same, take operation is sequential, but the next level of agent write operation is parallel, so must be fast;

3, in fact, with loadbalance in a certain sense can play the role of failover, the production of a large number of environmental recommendations loadbalance;

V. About monitoring monitor:

1, monitoring my side to do is still relatively small, but currently known to have the following: Cloudera Manager (if you have to install the CDH version), ganglia (this is inherently supported), HTTP (in fact, is the statistics JMX information, encapsulated into a JSON string, using Jetty display in the browser only), and then one is to implement the collection of monitoring information, do it yourself (can collect HTTP information or implement their own corresponding interface to implement their own logic, specific reference to my previous blog);

2, briefly say Cloudera Manager this kind of monitoring, recently in use, is indeed very powerful, can see real-time channel in and out data rate, channel real-time capacity, sink rate, source rate and so on, the graphical thing is really rich and intuitive, can provide a lot of flume agent overall operation of information and potential information;

Six, about flume start:

1. Flume Component Boot order: Channels-->sinks-->sources, close order:sources-->sinks-->channels;

2, automatically load the profile function, will first shut down all components, and then restart all components;

3, about the map<class< in Abstractconfigurationprovider; Extends Channel>, map<string, channel>> channelcache This object always stores all the Channel objects in the agent, because when dynamically loaded, There may be no data consumed in the channel, but the channel reconfiguration is required, so all data and configuration information of the channel object is cached since;

4. Cancel the auto-load configuration file function by adding "no-reload-conf" parameter to the start command.

Vii. About Interceptor:

Please see my blog about this component, portal;

Viii. about custom components: sink, source, channel:

1, channel does not recommend custom Oh, this requirement is relatively high, the other two are framework-style development, to specify the method to fill their own configuration, start, shut down, business logic can, in the future have the opportunity to write an article to introduce;

2. About custom components Please believe GitHub, there are a lot of many, many, can be directly used by custom components .... ；

Nine, about flume-ng cluster network topology scheme:

1, in each acquisition node to deploy a flume agent, and then do one or more summary flume agent (loadbalance), collect only responsible for collecting data to summarize, can write HDFs, HBase, Spark, local files, Kafka and so on, This general modification will be only in the summary, less agents, maintenance work less;

2, the Acquisition node is not deployed flume agent, may be sent to MONGO, Redis, etc., you need to customize the source or use the SDK to remove the data from the flume agent, so that the agent can also act as a "collection node" or summary node, But in front of the equivalent of adding a layer of control, there is another layer of risk;

3, due to limited capacity, other unknown, the above two, the first kind of better, here to see the structure of the United States ———— portal;

Things are relatively simple and easy to digest.

Other reference Links: Flume Research ExperienceFlume-based Log collection system (i) Architecture and designFlume-based Log collection system (ii) Improvement and optimization

Flume-ng some precautions (turn)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More