86th: sparkstreaming Data source flume actual case sharing

Source: Internet
Author: User

first, what is Flume?
Flume, as a real-time log collection system developed by Cloudera, has been recognized and widely used by the industry. The initial release version of Flume is now collectively known as Flume OG (original Generation), which belongs to Cloudera. But with the expansion of the FLume function, FLume OG code Engineering bloated, the core component design is unreasonable, the core configuration is not standard and other shortcomings exposed, especially in FLume OG final release 0.94.0, log transmission instability is particularly serious, in order to solve these problems, 2011 October 22, Cloudera completed the Flume-728 and made a milestone change to Flume: Refactoring the core components, core configuration, and code architecture, the reconstructed version is collectively known as Flume NG (Next generation), and another reason for the change is Flume Included in Apache, Cloudera Flume renamed Apache Flume.

features of Flume:
Flume is a distributed, reliable, and highly available system for collecting, aggregating, and transmitting large volumes of logs. Support for customizing various data senders in the log system for data collection, while Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, hbase, etc.).
Flume data flows are always run through events. An event is the basic unit of data for Flume, which carries log data (in the form of a byte array) and carries header information that is generated by source outside the agent, which is formatted when the source captures the event, and then the source pushes the event into (single or multiple) The channel. You can think of the channel as a buffer, which will save the event until sink finishes processing the event. Sink is responsible for persisting the log or pushing the event to another source.
reliability of the flume
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order: End-to-end (Received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend it.) ), Store On failure (this is also the policy adopted by scribe, when the data receiver crash, writes the data to the local, after the recovery, continues to send), BestEffort (data sent to the receiver, will not be confirmed).
Recoverability of the flume:
or by the channel. It is recommended to use FileChannel, where events persist in the local file system (poor performance).
Some core concepts of flume:
The Agent uses the JVM to run Flume. Each machine runs an agent, but it can contain multiple sources and sinks in one agent.

    1. The Client produces data that runs on a separate thread.
    2. Source collects data from the client and passes it to the channel.
    3. Sink collects data from the channel and runs on a separate thread.
    4. The Channel connects sources and sinks, which is a bit like a queue.
    5. Events can be log records, Avro objects, and so on.

Flume is the smallest independent operating unit of the agent. An agent is a JVM. A single agent consists of three components of source, sink, and channel, such as:

It is important to note that Flume provides a large number of built-in source, channel, and sink types. Different types of source,channel and sink can be freely combined. The combination is based on user-set profiles and is very flexible. For example, a channel can persist an event in memory, or it can be persisted to a local hard disk. Sink can write logs to HDFs, HBase, or even another source, and so on. Flume support users to establish multi-level flow, that is to say, multiple agents can work together, and support fan-in, fan-out, contextual Routing, Backup Routes, which is the place of NB. As shown in the following:

Second, Flume+kafka+spark streaming application scenario:

1, the Flume cluster collects the business information of the external system, takes the collected information to the Kafka cluster, and finally provides the spark streaming flow framework calculation processing, and then sends the final result to the Kafka storage after the flow processing is completed, such as:

2, the Flume cluster collects the business information of the external system, takes the collected information to the Kafka cluster, and finally provides the spark streaming flow framework calculation processing, and then sends the final result to the Kafka storage after the flow processing is finished. At the same time, the final results are graphically displayed through the Ganglia monitoring tool, which is structured as:

3. We want to do: Spark streaming interactive 360-degree visualization, spark streaming interactive 3D visualization ui;flume cluster captures business information from external systems, takes post-acquisition information to Kafka clusters, and ultimately provides spark Streaming flow framework calculation processing, the final results are sent to the Kafka storage after the flow processing, the final results are stored in the database (Mysql), Memory middleware (Redis, Memsql), and the final results are graphically displayed through ganglia monitoring tools, Schemas such as:

Three, Kafka data written to spark streaming there are two ways:

One is receivers, this method uses receivers to receive data, receivers implementation uses to Kafka high-level consumer API, for all receivers, the received data will be saved in spark Distributed executors, which is then processed by the job initiated by spark streaming, however, in the default configuration, this method loses data in case of failure, in order to guarantee 0 data loss, you can The use of the Wal log feature in streaming allows us to save the received data to the Wal (the Wal log can be stored on HDFS), so we can recover from the Wal when it fails, without losing data.

The other is DIRECTAPI, when the data is generated and the data is processed on both machines? In fact, on the same data, because there are driver and executor on a machine, this machine should be strong enough.

The flume cluster puts the collected data into the Kafka cluster, and Spark streaming takes the data from the Kafka cluster in real-time online, via Directapi in Kafka Partition queries the latest offsets (offset) to read the data for each batch, even if the read fails to read the failed data according to the offset, ensuring the stability and data reliability of the application run.

Warm tips:

1, flume cluster data written to the Kafka cluster may result in uneven data storage, that is, some Kafka node data is very large, some small, subsequent to the distribution of data to the custom algorithm to solve the problem of data storage imbalance.

2, the individual strongly recommends in the production environment uses the DIRECTAPI, but our distribution edition, will optimize to the DIRECTAPI, reduces its delay.


In the actual production environment, the collection of distributed logs takes Kafka as the core.

With spark streaming you can handle a variety of data source types, such as database, HDFS, server log logs, network streams, which are more powerful than you might imagine, but are often not used by people, and the real reason for this is the spark, spark Streaming itself does not understand.

86th: sparkstreaming Data source flume actual case sharing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.