Flume + kafka + storm + mysql

Source: Internet
Author: User
Tags zookeeper client
I have always wanted to get familiar with Storm real-time computing. Recently, I saw the building documents of the FlumeKafkaStorm real-time log stream system written by Luo Bao in a group in Shanghai, in the previous articles of luobao, there were some points not mentioned. In the future, I will make some corrections here. The content should be said to refer to the vast majority of luobao.

I have always wanted to get familiar with Storm real-time computing. Recently, I saw the building documents of the FlumeKafkaStorm real-time log stream system written by Luo Bao in a group in Shanghai, in the previous articles of luobao, there were some points not mentioned. In the future, I will make some corrections here. The content should be said to refer to the vast majority of luobao.

I have always wanted to get familiar with Storm real-time computing. Recently, I saw a document on how to set up a real-time log stream system of Flume + Kafka + Storm written by Luo Baobao in the group, I followed it all over myself. Some of the Articles in luobao mentioned previously should be noted. I will fix some mistakes in the future; the content should be said to be the vast majority of articles cited by luobao. Thanks to the luobao brothers, and I wrote this article @ J2EE J2EE, which also helped me a lot. Here also thanks @ J2EE J2EE

When I was doing this, I discussed it with some people in the group. Some people said that using storm directly would not be able to do real-time processing, and it would not be that troublesome; otherwise, we all know about Modularization in software development. There are two reasons for this design:

On the one hand, it can be modularized and its functions can be clearly divided, from "data collection-Data Access-loss computing-data output/storage"

1). Data collection

Collects data from various nodes in real time, and selects cloudera flume for implementation.

2). Data Access

Because the data collection speed and data processing speed are not necessarily synchronized, a message middleware is added as a buffer, and apache kafka is used.

3). Stream computing

Real-time analysis of collected data, using apache storm

4). Data Output

Persists the analysis results.

On the other hand, after modularization, when Storm fails, data collection and data access continue to run without data loss. After storm is started, streaming computing can continue;


Next let's look at the overall architecture diagram.


Detailed introduction of each component and installation Configuration:

Operating System: ubuntu

Flume

Flume is a distributed, reliable, and highly available log collection, aggregation, and transmission system provided by Cloudera. It supports customizing various data senders in the log system to collect data; at the same time, Flume provides the ability to simply process data and write it to various data receivers (customizable.

Typical architecture of flume:

Flume data source and output method:

Flume provides the following functions: console (console), RPC (Thrift-RPC), text (file), tail (UNIX tail), and syslog (syslog log system, supports two modes such as TCP and UDP), exec (command execution) and other data sources to collect data. In our system, the exec method is currently used for log collection.

The data receiver of Flume can be console (console), text (file), dfs (HDFS file), RPC (Thrift-RPC), and syslogTCP (TCP syslog log system. In our system, kafka receives the message.

Flume download and documentation:

Http://flume.apache.org/

Flume installation:

[Plain] view plaincopy

  1. $ Tar zxvf apache-flume-1.4.0-bin.tar.gz/usr/local

Flume startup command:

[Plain] view plaincopy

  1. $ Bin/flume-ng agent -- conf-file conf/flume-conf.properties -- name producer-Dflume. root. logger = INFO, console


Kafka

Kafka is a high-throughput distributed message publishing and subscription system. It has the following features:

  • The O (1) disk data structure provides message persistence. This structure can maintain stable performance for even the storage of several terabytes of messages for a long time.
  • High throughput: even a very common hardware kafka supports 100,000 messages per second.
  • Messages can be partitioned by kafka servers and consumer clusters.
  • Supports parallel Hadoop data loading.

Kafka aims to provide a publish/subscribe solution that can process all the action stream data on a website with a consumer scale. Such actions (Web browsing, search, and other user actions) are a key factor in many social functions on modern networks. This data is usually solved by processing logs and log aggregation due to throughput requirements. This is a feasible solution for log data and offline analysis systems like Hadoop that require real-time processing. Kafka aims to unify online and offline message processing through the parallel loading mechanism of Hadoop, and also to provide real-time consumption through cluster machines.

The kafka distributed subscription architecture is as follows:

The architecture diagram in the luobao brothers' article is as follows:

In fact, there is no big difference between the two. The architecture diagram on the official website only shows Kafka as a Kafka Cluster, and the architecture diagram of the luobao brothers is relatively detailed;

Kafka version: 0.8.0

Kafka download and documentation: http://kafka.apache.org/

Kafka installation:

[Plain] view plaincopy

  1. > Tar xzf kafka- . Tgz
  2. > Cd kafka-
  3. >./Sbt update
  4. >./Sbt package
  5. >./Sbt assembly-package-dependency

Start and test commands:

(1) start server

[Plain] view plaincopy

  1. > Bin/zookeeper-server-start.shconfig/zookeeper. properties
  2. > Bin/kafka-server-start.shconfig/server. properties

Here is the tutorial on the official website. kafka itself has a built-in zookeeper, but I use a separate zookeeper cluster in actual deployment, so I did not execute the first command line, here are some examples for you to see.

To configure an independent zookeeper cluster, you must configure the server. properties file to change zookeeper. connect to the IP address and port of the independent cluster.

[Plain] view plaincopy

  1. Zookeeper. connect = MAID: 2181

(2) Create a topic

[Plain] view plaincopy

  1. > Bin/kafka-create-topic.sh -- zookeeper localhost: 2181 -- replica 1 -- partition 1 -- topic test
  2. > Bin/kafka-list-topic.sh -- zookeeperlocalhost: 2181

(3) Send some messages

[Plain] view plaincopy

  1. > Bin/kafka-console-producer.sh -- broker-list localhost: 9092 -- topic test

(4) Start a consumer

[Plain] view plaincopy

  1. > Bin/kafka-console-consumer.sh -- zookeeper localhost: 2181 -- topic test -- from-beginning

Kafka-console-producer.sh and kafka-console-cousumer.sh are just the command line tools that the system provides. This is started to test whether normal production and consumption can be performed. Verify that the process is correct.

In actual development, you must develop your own producers and consumers;

The installation of kafka can also refer to my previous article: http://blog.csdn.net/weijonathan/article/details/18075967

Storm

Twitter officially opened Storm open-source, a distributed, fault-tolerant real-time computing system hosted on GitHub, following the Eclipse Public License 1.0. Storm is a real-time processing system developed by BackType, which is now owned by Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure.


Storm has the following features:

  1. Simple programming model. Similar to MapReduce, it reduces the complexity of parallel batch processing and Storm reduces the complexity of real-time processing.
  2. You can use various programming languages. You can use various programming languages on top of Storm. Clojure, Java, Ruby, and Python are supported by default. To support other languages, you only need to implement a simple Storm communication protocol.
  3. Fault Tolerance. Storm manages work processes and node faults.
  4. Horizontal scaling. Computing is performed in parallel between multiple threads, processes, and servers.
  5. Reliable message processing. Storm ensures that each message can be processed completely at least once. When a task fails, it is responsible for retrying messages from the message source.
  6. Fast. The system design ensures that messages can be processed quickly? MQ serves as its underlying message queue. (0.9.0.1 support? MQ and netty)
  7. Local Mode. Storm has a "Local Mode" that can fully simulate a Storm cluster during processing. This allows you to quickly perform development and unit testing.
Due to space issues, specific installation steps can refer to my previous article: http://blog.csdn.net/weijonathan/article/details/17762477

Next, let's start playing! That's the integration between frameworks.

Integration of flume and kafka

1. Download flume-kafka-plus: https://github.com/beyondj2ee/flumeng-kafka-plugin

2. Extract the flume-conf.properties file in the plug-in

Modify the file: # source p

Producer. sources. s. type = exec
Producer. sources. s. command = tail-f-n + 1/mnt/hgfs/vmshare/test. log
Producer. sources. s. channels = c

Change the value of all topics to test.

Put the modified configuration file in the flume/conf directory.

In this project, extract the following jar package and put it under the lib of flume in the environment:


After completing the above steps, let's test whether the flume + kafka process has passed;

We start flume first, then start kafka, start the step to follow the previous steps; then we use the kafka kafka-console-consumer.sh script to check whether flume has transmitted data to Kafka;


The above is the data that is captured by my test. log File and uploaded to kafka Through flume. It indicates that our flume and kafka processes have passed;

Do you still remember our flowchart at the beginning? One step is Through flume to kafka, And the other step is to hdfs; we have not mentioned how to store kafka in hdfs at the same time;

Flume is to support data synchronization replication, synchronization replication flow chart is as follows, taken from the flume official website, official site User Guide address: http://flume.apache.org/FlumeUserGuide.html


For how to configure synchronous replication, see the following Configuration:

[Plain] view plaincopy

  1. # Two channels and two sink configuration files. Here we can set two sinks: kafka and hdfs;
  2. A1.sources = r1
  3. A1.sinks = k1 k2
  4. A1.channels = c1 c2

You can set specific configurations according to your needs. Here is not an example.

Integration of kafka and storm

1. Download kafka-storm0.8 Plugin: https://github.com/wurstmeister/storm-kafka-0.8-plus

2. Compile with maven package to get the storm-kafka-0.8-plus-0.3.0-SNAPSHOT.jar package

3. Put the jar package and kafka_2.9.2-0.8.0-beta1.jar, metrics-core-2.2.0.jar, scala-library-2.9.2.jar (these three jar packages can be found in the kafka Project)

Note: If the development project requires other jar, remember to also put in storm Lib, for example, with mysql, you need to add mysql-connector-java-5.1.22-bin.jar to storm lib

Then we will restart storm;

After completing the above steps, we have another thing to do, that is, using the kafka-storm0.8 plug-in to write a Storm program of our own;

Here I will attach a storm program I got to you. Baidu online storage address: http://pan.baidu.com/s/1bnEdgh5;

First, let's take a look at the code for creating the Topology of the program.


Data operations are mainly in the WordCounter class. Here we only use simple JDBC for insertion.


You only need to enter a parameter as the Topology name! The local mode is used here. Therefore, if you do not enter a parameter, you can directly check whether the process passes;

[Plain] view plaincopy

  1. Storm-0.9.0.1/bin/storm jar storm-start-demo-0.0.1-SNAPSHOT.jar com. storm. topology. MyTopology

Let's take a look at the log. The data is printed out here and inserted into the database.


Check the database. The database is successfully inserted!


Here our entire integration is complete!

But there is another problem. I don't know if anyone has found it. This is also what I should have thought of in the early morning star J2EE;

Because we use storm for Distributed streaming computing, the most important thing to note in distributed computing is data consistency and avoiding dirty data generation. Therefore, the test project I provide can only be used for testing, formal Development cannot do this;

@ J2EE J2EE recommends creating a Distributed Global lock for zookeeper to ensure data consistency and avoid dirty data input!

You can use Netflix Curator to complete the zookeeper client framework. As I haven't checked this, I can only write it here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.