[Reprint] Building Big Data real-time systems using Flume+kafka+storm+mysql

Last Update:2015-05-28 Source: Internet

Author: User

Tags syslog uuid zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://mp.weixin.qq.com/s?__biz=MjM5NzAyNTE0Ng==&mid=205526269&idx=1&sn= 6300502dad3e41a36f9bde8e0ba2284d&key= C468684b929d2be22eb8e183b6f92c75565b8179a9a179662ceb350cf82755209a424771bbc05810db9b7203a62c7a26&ascene=0 &uin=mjk1odmyntyymg%3d%3d&devicetype=imac+macbookpro9%2c2+osx+osx+10.10.3+build (14D136) &version= 11000003&pass_ticket=hkr%2bxkpfbrbviwepmb7sozvfydm5cihu8hwlvne78ykusyhcq65xpav9e1w48ts1

Although I have always disapproved of the full use of open source software as a system, but for startups, it is high efficiency and low cost, there is a potential application space. Risk is the maintenance of the system.

This article describes how to use Flume+kafka+storm+mysql to build a distributed big data streaming architecture that covers the basic architecture, installation, deployment, and so on.

Architecture diagram Data Flow graph

(is the Visio painting, the diagram is too big, put up the word looks relatively small, if there is a need for friends to stay in the mailbox)

Introduction to Real-time log Analysis system architecture

The system is divided into four main parts:

1).

Responsible for collecting data in real time from each node and choosing Cloudera Flume to realize

2). Data access

Because the speed of and the speed of data processing are not necessarily synchronous, a message middleware is added as a buffer, using Apache's Kafka

3). Flow-based computing

Real-time analysis of collected data, using Apache's storm

4). Data output

Persistent with the results of the analysis, tentatively using MySQL

Detailed description of each component and installation configuration:

Operating system: centos6.4

Flume

Flume is a distributed, reliable, and highly available log collection system for Cloudera, which supports the customization of various data senders in the log system and collects data, while Flume provides simple processing of data and writes to various data recipients (customizable ) capabilities.

Typical architecture for Flume:

Flume data source and output mode:

Flume provides 2 modes from console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log system, TCP and UDP support), EXEC (command execution) The ability to collect data on a data source is currently used by exec in our system for log capture.

Flume data recipients, which can be console (console), text (file), DFS (HDFs file), RPC (THRIFT-RPC), and syslogtcp (TCP syslog log system), and so on. It is received by Kafka in our system.

Flume version: 1.4.0

Flume Download and Documentation:

http://flume.apache.org/

Flume Installation:

$tar ZXVF apache-flume-1.4.0-bin.tar.gz/usr/local

Flume Start command:

$bin/flume-ng agent--conf conf--conf-file conf/flume-conf.properties--name Producer-dflume.root.logger=info, Console

Note: You need to change the configuration file under the Conf directory and add the jar package to the Lib directory.

Kafka

Kafka is a message middleware that is characterized by:

1. Focus on high throughput, not other features

2. For real-time scenarios

3. The status of messages being processed is maintained on the consumer side, and not by the Kafka server side.

4, distributed, producer, broker and Consumer are distributed on multiple machines.

Architecture diagram for Kafka:

Kafka version: 0.8.0

Kafka Download and Documentation: http://kafka.apache.org/

Kafka Installation:

> Tar xzf kafka-<version>.tgz

> CD kafka-<version>

>./SBT Update

>./SBT Package

>./SBT assembly-package-dependency Kafka

Start and test commands:

(1) Start server

> bin/zookeeper-server-start.sh config/zookeeper.properties

> bin/kafka-server-start.sh config/server.properties

(2) Create a topic
> bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test

> bin/kafka-list-topic.sh--zookeeper localhost:2181

(3) Send some messages

> bin/kafka-console-producer.sh--broker-list localhost:9092--topic test

(4) Start a consumer

> bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic test--from-beginning

Storm

Storm is a distributed, high-fault-tolerant real-time computing system.

Storm Frame Composition:

Storm work task topology:

Storm version: 0.9.0

Storm Download: http://storm-project.net/

Storm Installation:

First step, install Python2.7.2

# wget Http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz

# tar ZXVF python-2.7.2.tgz

# CD Python-2.7.2

#./configure

# make

# make Install

# vi/etc/ld.so.conf

Step two, install Zookeeper (Kafka comes with zookeeper, if choose Kafka, the step can be omitted)

#wget Http://ftp.meisei-u.ac.jp/mirror/apache/dist//zookeeper/zookeeper-3.3.3/zoo keeper-3.3.3.tar.gz

# tar zxf zookeeper-3.3.3.tar.gz

# ln-s/USR/LOCAL/ZOOKEEPER-3.3.3//usr/local/zookeeper

# VI ~./BASHRC (set zookeeper_home and Zookeeper_home/bin)

Step three, install Java

$tar ZXVF jdk-7u45-linux-x64.tar.gz/usr/local

If you use the following versions of storm0.9, you need to install ZEROMQ and JZMQ.

Fourth step, install ZEROMQ and JZMQ

JZMQ installation seems to rely on ZEROMQ, so should first install ZEROMQ, and then installed JZMQ.

1) Install ZEROMQ (not required):

# wget http://download.zeromq.org/historic/zeromq-2.1.7.tar.gz
# tar zxf zeromq-2.1.7.tar.gz
# CD zeromq-2.1.7
#./configure
# make
# make Install
# sudo ldconfig (update ld_library_path)

Missing C + + environment: Yum Install gcc-c++

The following issues were encountered: Error:cannot link with-luuid, install Uuid-dev

This is because no UUID-related package is installed.

The workaround is: # yum install uuid*

# yum Install e2fsprogs*

# yum Install libuuid*

2) Install JZMQ (not required)

# yum Install git
# git clone git://github.com/nathanmarz/jzmq.git
# CD JZMQ
#./autogen.sh
#./configure
# make
# make Install

Then, JZMQ is installed, there is a website on the reference to the question did not meet, meet the child shoes can refer to. In./autogen.sh This step if the error: Autogen.sh:error:could not the Find Libtool is required to run autogen.sh, this is because of the lack of libtool, you can use #yum Install libtool* to solve.

If you are installing storm0.9 and above, you do not need to install ZEROMQ and JZMQ, but you need to modify Storm.yaml to specify that the message is transmitted as Netty:

Storm.local.dir: "/tmp/storm/data"

storm.messaging.transport: "backtype.storm.messaging.netty.Context"

storm.messaging.netty.server_worker_threads: 1

storm.messaging.netty.client_worker_threads: 1

storm.messaging.netty.buffer_size: 5242880

storm.messaging.netty.max_retries: 100

storm.messaging.netty.max_wait_ms: 1000

storm.messaging.netty.min_wait_ms: 100

Fifth Step, install storm

$unzip Storm-0.9.0-wip16.zip

Note: The standalone version does not need to modify the configuration file, distributed when modifying the configuration file should note: After the colon must be preceded by a space.

To test if Storm is installed successfully:

1. Download Strom Starter's code git clone https://github.com/nathanmarz/storm-starter.git

2. Compiling with the Mvn-f M2-pom.xml package

If you have not installed MAVEN, see the following steps to install:
1. Download http://maven.apache.org/from MAVEN's website

Tar zxvf apache-maven-3.1.1-bin.tar.gz/usr/local

Configuring MAVEN Environment variables

Export Maven_home=/usr/local/maven

Export path= $PATH: $MAVEN _home/bin

Verify that MAVEN is installed successfully: MVN-V

Modify the Storm-starter Pom file m2-pom.xml, modify the dependent version of the Twitter4j-core and Twitter4j-stream two packages in dependency as follows:

Org.twitter4j
Twitter4j-core
[2.2,)

Org.twitter4j
Twitter4j-stream
[2.2,)

Generate target folder after compiling

Start Zookeeper

zkserver.sh start

Start the Nimbus Supervisor UI

Storm Nimbus

Storm Supervisor

Storm UI

JPS Viewing the Startup status

Go to target directory to execute:

Storm Jar Storm-starter-0.0.1-snapshot-jar-with-dependencies.jar Storm.starter.WordCountTopology wordcounttop

Then view http://localhost:8080

Note: The standalone version does not have to be modified Storm.yaml

Kafka and Storm consolidation

1. Download kafka-storm0.8 plugin: https://github.com/wurstmeister/storm-kafka-0.8-plus

2. The project download down needs to debug, find the dependent jar package. Then repackage it as a jar package for our storm project.

3. Add the jar package and Kafka_2.9.2-0.8.0-beta1.jar Metrics-core-2.2.0.jar Scala-library-2.9.2.jar ( These three jar packages can be found in the Kafka-storm-0.8-plus project dependencies)

Note: If you are developing a project that requires additional jars, remember to put it into storm lib, such as using MySQL to add Mysql-connector-java-5.1.22-bin.jar to Storm's lib.

Flume and Kafka Integration

1. Download Flume-kafka-plus:https://github.com/beyondj2ee/flumeng-kafka-plugin

2. Extracting the Flume-conf.properties file from the plugin

Modify the File: #source section

Producer.sources.s.type = Exec
Producer.sources.s.command = Tail-f-n+1/mnt/hgfs/vmshare/test.log
Producer.sources.s.channels = C

Change the value of all topic to test

Put the changed configuration file into the flume/conf directory

In the project, extract the following jar packages into the environment under the flume Lib:

The above is a stand-alone version of the Flume+kafka+storm configuration installation

Flume+storm Plug-in

Https://github.com/xiaochawan/edw-Storm-Flume-Connectors

Start step

Start Project deployment start after installing Storm,flume,kafka (it's a good idea to follow the installation documentation for Storm Kafka flume individual component tests before deployment starts).

The first step
Put a well-written storm project into a jar package and place it on the server, if you put it in/usr/local/project/storm.xx.jar

Note: For a storm project, see Kafka and Storm consolidation in the installation documentation.

Step Two

Start the zookeeper (here you can start the Kafka with the zookeeper or start a separate installation of the Kafka, the following take Kafka to take the example)

Cd/usr/local/kafka

Bin/zookeeper-server-start.sh config/zookeeper.properties

Step Three

Start Kafka

Cd/usr/local/kafka

> bin/kafka-server-start.sh config/server.properties

Create a Theme

> bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test

Note: Because the offset of the Kafka message is managed by the zookeeper record, it is necessary to specify zookeeper's Ip,replica to indicate that the message of the topic has been copied several, partition that each topic is divided into sections. Test indicates the subject name.

Fourth Step

Start Storm

> Storm Nimbus

> Storm Supervisor

> Storm UI

cd/usr/local/project/

> Storm jar Storm.xx.jar storm.testtopology test

Note: Storm.xx.jar for us to write a good Storm project Jar package, the first step to complete the work. Storm.testtopology is the classpath of the main method in the Storm project. Test is the name of this topology.

Fifth Step

Start flume

Cd/usr/local/flume

Note: Flume.conf.properties for our custom flume configuration file, flume installation is not this file, we need to write their own, written in the way see Flume installation of the article.

Now that the program that needs to be started is all started and the Storm project is already running, you can turn on the storm UI to see if it's working.

http://localhost:8080

Note: The IP port of the machine IP for the storm Nimbus can be modified in the Storm profile Storm/conf/storm.yaml

This digest is titled "Flume+kafka+storm+mysql Architecture Design" from http://blog.csdn.net/mylittlered/article/details/20810265.

[Reprint] Building Big Data real-time systems using Flume+kafka+storm+mysql

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More