[Turn]flume-ng+kafka+storm+hdfs real-time system setup

Last Update:2014-11-19 Source: Internet

Author: User

Tags syslog zookeeper zookeeper client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/weijonathan/article/details/18301321

Always want to contact storm real-time computing this piece of things, recently in the group to see a brother in Shanghai Luobao wrote Flume+kafka+storm real-time log flow system building documents, oneself also followed the whole, before Luobao some of the articles in some to note not mentioned, some of the wrong points later, In this way I will do the amendment, the content should say that most of the articles quoted Luobao, here to thank the Luobao brothers, and write this article @ Morning color Sky ee also gave me a lot of help, here also thank you @ Morning color star EE

Before this time, with some people in the group discussed, some people said that directly with Storm can do real-time processing, not so troublesome, but in fact, do the software development are aware of the modular thinking, this design for two reasons:

On the one hand, it can be modularized, the function is divided more clearly, from the "Data acquisition--access--loss calculation-data output/storage"

1). Data acquisition

Responsible for collecting data in real time from each node and choosing Cloudera Flume to realize

2). Data access

Because the speed of data acquisition and the speed of data processing are not necessarily synchronous, a message middleware is added as a buffer, using Apache's Kafka

3). Flow-based computing

Real-time analysis of collected data, using Apache's storm

4). Data output

Persistent with the results of the analysis, tentatively using MySQL

On the other hand, after the addition of the module, when Storm has been suspended, data acquisition and data access will continue to run, and it is not lost, and can continue streaming calculation after storm.

So let's take a look at the overall architecture diagram.

Detailed description of each component and installation configuration:

Operating system: Ubuntu

Flume

Flume is a distributed, reliable, and highly available log collection system for Cloudera, which supports the customization of various data senders in the log system and collects data, while Flume provides simple processing of data and writes to various data recipients (customizable ) capabilities.

Typical architecture for Flume:

Flume data source and output mode:

Flume provides 2 modes from console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log system, TCP and UDP support), EXEC (command execution) The ability to collect data on a data source is currently used by exec in our system for log capture.

Flume data recipients, which can be console (console), text (file), DFS (HDFs file), RPC (THRIFT-RPC), and syslogtcp (TCP syslog log system), and so on. It is received by Kafka in our system.

Flume Download and Documentation:

http://flume.apache.org/

Flume Installation:

[Plain]View Plaincopy

$tar ZXVF apache-flume-1.4.0-bin.tar.gz/usr/local

Flume Start command:

[Plain]View Plaincopy

$bin/flume-ng agent--conf conf--conf-file conf/flume-conf.properties--name Producer-dflume.root.logger=info, Console

Kafka

Kafka is a high-throughput distributed publish-subscribe messaging system that has the following features:

Provides persistence of messages through the disk data structure of O (1), a structure that maintains long-lasting performance even with terabytes of message storage.
High throughput: Even very common hardware Kafka can support hundreds of thousands of messages per second.
Support for partitioning messages through Kafka servers and consumer clusters.
Supports Hadoop parallel data loading.

The purpose of Kafka is to provide a publishing subscription solution that can handle all the action flow data in a consumer-scale website. This kind of action (web browsing, search and other user actions) is a key factor in many social functions on modern networks. This data is usually resolved by processing logs and log aggregations due to throughput requirements. This is a viable solution for the same log data and offline analysis system as Hadoop, but requires real-time processing constraints. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time consumption through the cluster machine.

Kafka distributed subscription architecture such as:--taken from Kafka official website

The architecture diagram on the Luobao brothers article is like this

In fact, the two are not much different, the structure of the official website is just the Kafka concise representation of a Kafka Cluster, and the Luobao Brothers architecture diagram is relatively detailed;

Kafka version: 0.8.0

Kafka Download and Documentation: http://kafka.apache.org/

Kafka Installation:

[Plain]View Plaincopy

> Tar xzf kafka-<version>.tgz
> CD kafka-<version>
>./SBT Update
>./SBT Package
>./SBT assembly-package-dependency

Start and test commands:

(1) Start server

[Plain]View Plaincopy

> bin/zookeeper-server-start.shconfig/zookeeper.properties
> bin/kafka-server-start.shconfig/server.properties

Here is the official web tutorial, Kafka itself has built-in zookeeper, but I myself in the actual deployment is the use of a separate zookeeper cluster, so the first line of command I did not execute, here are just some to show you.

Configuring a standalone zookeeper cluster requires configuring the Server.properties file, speaking zookeeper.connect modifying the IP and port of the standalone cluster

[Plain]View Plaincopy

zookeeper.connect=nutch1:2181

(2) Create a topic

[Plain]View Plaincopy

> bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test
> bin/kafka-list-topic.sh--zookeeperlocalhost:2181

(3) Send some messages

[Plain]View Plaincopy

> bin/kafka-console-producer.sh--broker-list localhost:9092--topic Test

(4) Start a consumer

[Plain]View Plaincopy

> Bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic Test--from-beginning

Kafka-console-producer.sh and kafka-console-cousumer.sh are just the system-provided command-line tools. This is done to test the normal production of consumption; Verify process correctness

In the actual development of the self-development of their own producers and consumers;

Kafka installation can also refer to the article I wrote earlier: http://blog.csdn.net/weijonathan/article/details/18075967

Storm

Twitter is officially open source for Storm, a distributed, fault-tolerant, real-time computing system that is hosted on GitHub and follows the Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure.

The main features of Storm are as follows:

A simple programming model. Similar to mapreduce reduces the complexity of parallel batching, storm reduces the complexity of real-time processing.
You can use a variety of programming languages. You can use a variety of programming languages on top of storm. Clojure, Java, Ruby, and Python are supported by default. To increase support for other languages, simply implement a simple storm communication protocol.
Fault tolerance. Storm manages the failure of worker processes and nodes.
Horizontal expansion. Calculations are performed in parallel between multiple threads, processes, and servers.
Reliable message handling. Storm guarantees that each message can be processed at least once. When a task fails, it is responsible for retrying the message from the message source.
Fast. The design of the system ensures that the message can be processed quickly, using ØMQ as its underlying message queue. (0.9.0.1 version supports both ØMQ and Netty two modes)
Local mode. Storm has a "local mode" that can fully simulate storm clusters during processing. This allows you to quickly develop and unit test.

Due to the length of the problem, the specific installation procedures can refer to my previous article: http://blog.csdn.net/weijonathan/article/details/17762477

The next play starts pulling! That's the integration between the frames.

Flume and Kafka Integration

1. Download Flume-kafka-plus:https://github.com/beyondj2ee/flumeng-kafka-plugin

2. Extracting the Flume-conf.properties file from the plugin

Modify the File: #source section

Producer.sources.s.type = Exec
Producer.sources.s.command = Tail-f-n+1/mnt/hgfs/vmshare/test.log
Producer.sources.s.channels = C

Change the value of all topic to test

Put the changed configuration file into the flume/conf directory

In the project, extract the following jar packages into the environment under the flume Lib:

Note: Here's Flumeng-kafka-plugin.jar this package, which has been moved to the packages directory later on in the GitHub project. Children's shoes that cannot be found can be obtained from the package directory.

After completing the above steps, we will test the next Flume+kafka the process is not going through;

We start flume, and then start the Kafka, the start step to follow the previous steps, and then we use the Kafka kafka-console-consumer.sh script to see if there is flume to transmit data to Kafka;

Above this is my Test.log file through flume crawl to Kafka data, show our flume and Kafka process go through;

Do you remember the beginning of our flowchart, one of the steps is through the flume to Kafka, and one step is to HDFs, and our side has not mentioned how to deposit Kafka and at the same time as HDFs;

Flume is support data synchronous replication, synchronous replication flowchart is as follows, taken from Flume official website, official website User Guide address: http://flume.apache.org/FlumeUserGuide.html

How to set up synchronous replication, look at the following configuration:

[Plain]View Plaincopy

#2个channel和2个sink的配置文件 Here we can set up two sink, one is Kafka, the other is HDFs;
A1.sources = R1
A1.sinks = K1 K2
A1.channels = C1 C2

The specific configuration of the guys according to their own needs to set, here is not specific examples of

Integration of Kafka and Storm

1. Download kafka-storm0.8 plugin: https://github.com/wurstmeister/storm-kafka-0.8-plus

2. Use MAVEN package to compile, get Storm-kafka-0.8-plus-0.3.0-snapshot.jar pack --There are reproduced children's shoes note, here the package name is wrong before, now correct! Excuse me!

3. Add the jar package and Kafka_2.9.2-0.8.0-beta1.jar, Metrics-core-2.2.0.jar, Scala-library-2.9.2.jar (these three jar packages can be found in the Kafka project)

Note: If you are developing a project that requires additional jars, remember to put it into storm lib, such as using MySQL to add Mysql-connector-java-5.1.22-bin.jar to Storm's lib.

So then we'll restart storm.

After completing the above steps, we have one more thing to do, is to use the kafka-storm0.8 plugin, write your own storm program;

Here I give everyone to attach a storm program that I get, Baidu network disk share address: Http://pan.baidu.com/s/1mgp0LLY

First look at the program's Creation topology code

Data operations are primarily in the WordCounter class, where only simple JDBC is used for insert processing

Here you just need to enter a parameter as the topology name! We use local mode here, so do not input parameters, directly see whether the process is going through;

[Plain]View Plaincopy

Storm-0.9.0.1/bin/storm jar Storm-start-demo-0.0.1-snapshot.jar Com.storm.topology.MyTopology

Let's look at the log, print it out, insert data into the database.

Then we look at the database and insert it successfully!

Our entire integration is complete here!

But there is a problem here, I do not know whether they have found. This is also @ Morning color Sky ee told me, in fact, I should also think of;

Since we use storm for distributed streaming computing, the most important thing in distributed distribution is the consistency of data and the avoidance of dirty data, so the test project I provide can only be used for testing, and formal development cannot handle it.

@ Morning color star Java EE advice is to build a zookeeper distributed global lock, ensure data consistency, avoid dirty data entry!

Zookeeper Client Framework We can use Netflix curator to do it, because this piece I haven't seen, so I can only write here!

Here at once thank Luobao and @ Morning color starry sky!

[Turn]flume-ng+kafka+storm+hdfs real-time system setup

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More