Notes on flume-ng (not updated on a regular basis)

Source: Internet
Author: User

 

Here, we only consider some of flume's own things. For the moment, such as JVM, HDFS, and hbase are not involved ....

 

1. About Source:

1. Spool-Source: Suitable for static files, that is, files are not dynamically changed;

2. Avro source can increase the number of threads to improve the performance of the source;

3. thriftsource has a problem during use. If an exception occurs during batch operation, the exception content is not printed. Instead, thrift source % s cocould not append events to the channel. ", this is because when an exception occurs in the source code, it does not capture the exception but obtains the component name. This is a bug in the source code. It can also be noted that thrift is rarely used, otherwise, this problem will not exist in many versions;

4. If a source corresponds to multiple channels, each channel is the same data by default, and N copies of the data will be sent to N channels, therefore, if a channel is full, the overall speed will be affected;

5. execsource official documents have already stated that it is asynchronous and may cause data loss. Try to use tail-F. Note that it is in uppercase;

 

2. About channel:

1. We recommend that you use the new composite spillablememorychannel for the collection node. We recommend that you use memory channel for the summary node, depending on the actual data volume, generally, memory channel is recommended for Flume agents whose data volume exceeds MB per minute (the file channel processing speed is about 2 m/s, which may vary with machines and environments, this is for reference only), because once the channel of this agent overflows, it will lead to a majority of time in the file channel (spillablememorychannel itself is a subclass of the file channel, in addition, the composite channel ensures a certain sequence of events so that after reading the data in the memory, overflow needs to be taken away, and the memory may overflow when it is full ...), The performance is greatly reduced. Once summarized, the consequences can be imagined;

2. Adjust memory to occupy physical memory space. Two parameters are required: bytecapacitybufferpercentage (20 by default) and bytecapacity (0.8 of the maximum available memory of JVM by default). The formula is as follows: bytecapacity = (INT) (context. getlong ("bytecapacity", defaultbytecapacity ). longvalue () * (1-bytecapacitybufferpercentage *. (01)/bytecapacityslotsize), you can obviously adjust these two parameters to control. As for bytecapacityslotsize, the default value is 100, and the physical memory is converted to the number of slots, which is easy to manage, but it may waste space, at least I think so ...;

3. Another useful parameter "keep-alive" is used to control the transmission of source when the channel is full. If the channel is empty, the consumption of sink is affected, that is, the waiting time, the default value is 3 S. If this time is exceeded, an exception is thrown and generally does not need to be configured. However, it is useful in some cases. For example, you have to use the scenario to send data at the beginning of each minute, at this time, the starting volume per minute may be relatively large and the subsequent volume will be smaller and smaller. You can increase this parameter so that the channel is not full;

 

Iii. Sink:

1. Avro sink batch-size can be set to a larger value. The default value is 100. Increasing the value will reduce the number of RPC requests and improves performance;

2. built-in HDFS sink resolution timestamps to set the directory or file prefix is very performance-consuming, because it is matching based on the regular expression, I will write an article later to illustrate this problem;

3. The rollingfilesink file name cannot be customized, and the file cannot be rolled regularly. You can only scroll by time interval. You can define the sink for regular file writing;

4. The timestamp in the HDFS sink file name cannot be omitted. You can add information such as the prefix, suffix, and prefix and suffix of the file being written. "HDFS. idletimeout "is meaningful. It refers to how long the HDFS file is being written to close the file without updating it. It is recommended that all files be configured, for example, if you set a resolution timestamp to store different directories and file names, and rollinterval = 0, rollcount = 0, rollsize = 1000000, if the data volume in this time cannot meet the rollsize requirement and the data is written to a new file later, it will be opened all the time. Similar situations may not be noticed. "HDFS. calltimeout refers to the maximum operation time specified by each HDFS operation (read, write, open, close, etc. threadspoolsize "specifies a thread in the thread pool to operate;

5. For hbase sink (non-asynchronous hbase sink: asynchbasesink), rowkey cannot be customized, and one serializer can only write one column, and one serializer can match multiple columns according to regular expressions, performance may be faulty. We recommend that you write an hbase sink as needed;

6. You can configure failover and loadbalance for Avro sink. The components used are the same as those in sinkgroup. You can also configure compression options here. You need to configure decompression in Avro source;

 

IV about sinkgroup:

1. Whether loadbalance or failover, multiple sinks need to share one channel;

2. If multiple sinks of loadbalance are directly output to the same device, for example, HDFS, the performance does not increase significantly, because sinkgroup is a single thread, its process method will call each sink in turn to take data in the channel, and ensure that the processing is correct, so that the operation is sequential, however, if it is sent to the next level of flume agent, the take operation is sequential, but the write operation of the next level of agent is parallel, so it must be fast;

3. In fact, using loadbalance can play the role of failover in a certain sense. loadbalance is recommended for large production environments;

 

V. Monitoring monitor:

1. Monitoring is still relatively rare on my side, but there are currently the following known types: cloudera Manager (provided that you have to install the CDH version) and ganglia (supported by nature) HTTP (in fact, JMX information is encapsulated into a JSON string and jetty is used for display in the browser), and the other is to collect monitoring information by yourself, do It Yourself (you can collect HTTP information or implement the corresponding interface to implement your own logic. For details, refer to my previous blog );

2. clouddera manager monitoring is very powerful recently, you can view the Real-Time Channel inbound and outbound data rate, channel Real-time capacity, sink outbound rate, source inbound rate, and so on. The graphic content is indeed rich and intuitive, it provides a lot of information about the overall running status and potential information of the flume agent;

 

6. About flume startup:

1. Flume component startup sequence: channels --> sinks --> sources, closing sequence: Sources --> sinks --> channels;

2. The configuration file function is automatically loaded. All components are disabled first, and then all components are restarted;

3. About Map <class <? Extends channel>, Map <string, channel> channelcache, which stores all channel objects in the agent, because there may be unfinished data in the channel during dynamic loading, however, you need to reconfigure the channel, so all the data and configuration information of the channel object has been cached since use;

4. Disable the automatic loading of configuration files by adding the "no-reload-conf" parameter to the startup command to true;

 

VII. Interceptor:

Please refer to my blog about this component, portal;

 

8. About custom components: Sink, source, and channel:

1. Channel customization is not recommended. This requirement is relatively high. Both of them are framework-based Development. You can fill in your own configuration, startup, shutdown, and business logic for the specified method, I will have the opportunity to write an article to introduce it in the future;

2. For custom components, please trust GitHub. There are a lot of custom components that can be used directly ....;

 

9. Flume-ng cluster network topology solution:

1. Deploy a flume agent on each collection node, and then perform one or more summary flume agent (loadbalance). The collection is only responsible for collecting data and sending it to the summary, summary can be written to HDFS, hbase, spark, local files, Kafka, etc. In this way, the general modification will only be summarized, with fewer agents and less maintenance work;

2. The collection node does not deploy the flume agent, which may be sent to Mongo or redis. In this case, you need to customize the source or use the SDK to retrieve and send the data to the flume agent, in this way, the agent can act as a "collection node" or a summary node. However, adding a layer of control in front of the agent adds another layer of risk;

3. Due to limited capacity and other unknown factors, the first two types are better. Let's look at the architecture of Meituan-portal;

 

It is simple and easy to digest.

Not complete, To be continued... Please add

Notes on flume-ng (not updated on a regular basis)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.