Original: http://mp.weixin.qq.com/s?__biz=MjM5NzAyNTE0Ng==&mid=205526269&idx=1&sn= 6300502dad3e41a36f9bde8e0ba2284d&key= C468684b929d2be22eb8e183b6f92c75565b8179a9a179662ceb350cf82755209a424771bbc05810db9b7203a62c7a26&ascene=0 &uin=mjk1odmyntyymg%3d%3d&devicetype=imac+macbookpro9%2c2+osx+osx+10.10.3+build (14D136) &version= 11000003&pass_ticket=hkr%2bxkpfbrbviwepmb7sozvfydm5cihu8hwlvne78ykusyhcq65xpav9e1w48ts1
Although I have always disapproved of the full use of open source software as a system, but for startups, it is high efficiency and low cost, there is a potential application space. Risk is the maintenance of the system.
This article describes how to use Flume+kafka+storm+mysql to build a distributed big data streaming architecture that covers the basic architecture, installation, deployment, and so on.
Architecture diagram
Data Flow graph
(is the Visio painting, the diagram is too big, put up the word looks relatively small, if there is a need for friends to stay in the mailbox)
Introduction to Real-time log Analysis system architecture
The system is divided into four main parts:
1).
Responsible for collecting data in real time from each node and choosing Cloudera Flume to realize
2). Data access
Because the speed of and the speed of data processing are not necessarily synchronous, a message middleware is added as a buffer, using Apache's Kafka
3). Flow-based computing
Real-time analysis of collected data, using Apache's storm
4). Data output
Persistent with the results of the analysis, tentatively using MySQL
Detailed description of each component and installation configuration:
Operating system: centos6.4
Flume
Flume is a distributed, reliable, and highly available log collection system for Cloudera, which supports the customization of various data senders in the log system and collects data, while Flume provides simple processing of data and writes to various data recipients (customizable ) capabilities.
Typical architecture for Flume:
Flume data source and output mode:
Flume provides 2 modes from console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log system, TCP and UDP support), EXEC (command execution) The ability to collect data on a data source is currently used by exec in our system for log capture.
Flume data recipients, which can be console (console), text (file), DFS (HDFs file), RPC (THRIFT-RPC), and syslogtcp (TCP syslog log system), and so on. It is received by Kafka in our system.
Flume version: 1.4.0
Flume Download and Documentation:
http://flume.apache.org/
Flume Installation:
$tar ZXVF apache-flume-1.4.0-bin.tar.gz/usr/local
Flume Start command:
$bin/flume-ng agent--conf conf--conf-file conf/flume-conf.properties--name Producer-dflume.root.logger=info, Console
Note: You need to change the configuration file under the Conf directory and add the jar package to the Lib directory.
Kafka
Kafka is a message middleware that is characterized by:
1. Focus on high throughput, not other features
2. For real-time scenarios
3. The status of messages being processed is maintained on the consumer side, and not by the Kafka server side.
4, distributed, producer, broker and Consumer are distributed on multiple machines.
Architecture diagram for Kafka:
Kafka version: 0.8.0
Kafka Download and Documentation: http://kafka.apache.org/
Kafka Installation:
> Tar xzf kafka-<version>.tgz
> CD kafka-<version>
>./SBT Update
>./SBT Package
>./SBT assembly-package-dependency Kafka
Start and test commands:
(1) Start server
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
(2) Create a topic
> bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test
> bin/kafka-list-topic.sh--zookeeper localhost:2181
(3) Send some messages
> bin/kafka-console-producer.sh--broker-list localhost:9092--topic test
(4) Start a consumer
> bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic test--from-beginning
Storm
Storm is a distributed, high-fault-tolerant real-time computing system.
Storm Frame Composition:
Storm work task topology:
Storm version: 0.9.0
Storm Download: http://storm-project.net/
Storm Installation:
First step, install Python2.7.2
# wget Http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
# tar ZXVF python-2.7.2.tgz
# CD Python-2.7.2
#./configure
# make
# make Install
# vi/etc/ld.so.conf
Step two, install Zookeeper (Kafka comes with zookeeper, if choose Kafka, the step can be omitted)
#wget Http://ftp.meisei-u.ac.jp/mirror/apache/dist//zookeeper/zookeeper-3.3.3/zoo keeper-3.3.3.tar.gz
# tar zxf zookeeper-3.3.3.tar.gz
# ln-s/USR/LOCAL/ZOOKEEPER-3.3.3//usr/local/zookeeper
# VI ~./BASHRC (set zookeeper_home and Zookeeper_home/bin)
Step three, install Java
$tar ZXVF jdk-7u45-linux-x64.tar.gz/usr/local
If you use the following versions of storm0.9, you need to install ZEROMQ and JZMQ.
Fourth step, install ZEROMQ and JZMQ
JZMQ installation seems to rely on ZEROMQ, so should first install ZEROMQ, and then installed JZMQ.
1) Install ZEROMQ (not required):
-
# wget http://download.zeromq.org/historic/zeromq-2.1.7.tar.gz
-
# tar zxf zeromq-2.1.7.tar.gz
-
# CD zeromq-2.1.7
-
#./configure
-
# make
-
# make Install
-
# sudo ldconfig (update ld_library_path)
Missing C + + environment: Yum Install gcc-c++
The following issues were encountered: Error:cannot link with-luuid, install Uuid-dev
This is because no UUID-related package is installed.
The workaround is: # yum install uuid*
# yum Install e2fsprogs*
# yum Install libuuid*
2) Install JZMQ (not required)
Then, JZMQ is installed, there is a website on the reference to the question did not meet, meet the child shoes can refer to. In./autogen.sh This step if the error: Autogen.sh:error:could not the Find Libtool is required to run autogen.sh, this is because of the lack of libtool, you can use #yum Install libtool* to solve.
If you are installing storm0.9 and above, you do not need to install ZEROMQ and JZMQ, but you need to modify Storm.yaml to specify that the message is transmitted as Netty:
Storm.local.dir: "/tmp/storm/data"
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 100
storm.messaging.netty.max_wait_ms: 1000
storm.messaging.netty.min_wait_ms: 100
Fifth Step, install storm
$unzip Storm-0.9.0-wip16.zip
Note: The standalone version does not need to modify the configuration file, distributed when modifying the configuration file should note: After the colon must be preceded by a space.
To test if Storm is installed successfully:
1. Download Strom Starter's code git clone https://github.com/nathanmarz/storm-starter.git
2. Compiling with the Mvn-f M2-pom.xml package
If you have not installed MAVEN, see the following steps to install:
1. Download http://maven.apache.org/from MAVEN's website
Tar zxvf apache-maven-3.1.1-bin.tar.gz/usr/local
Configuring MAVEN Environment variables
Export Maven_home=/usr/local/maven
Export path= $PATH: $MAVEN _home/bin
Verify that MAVEN is installed successfully: MVN-V
Modify the Storm-starter Pom file m2-pom.xml, modify the dependent version of the Twitter4j-core and Twitter4j-stream two packages in dependency as follows:
Org.twitter4j
Twitter4j-core
[2.2,)
Org.twitter4j
Twitter4j-stream
[2.2,)
Generate target folder after compiling
Start Zookeeper
zkserver.sh start
Start the Nimbus Supervisor UI
Storm Nimbus
Storm Supervisor
Storm UI
JPS Viewing the Startup status
Go to target directory to execute:
Storm Jar Storm-starter-0.0.1-snapshot-jar-with-dependencies.jar Storm.starter.WordCountTopology wordcounttop
Then view http://localhost:8080
Note: The standalone version does not have to be modified Storm.yaml
Kafka and Storm consolidation
1. Download kafka-storm0.8 plugin: https://github.com/wurstmeister/storm-kafka-0.8-plus
2. The project download down needs to debug, find the dependent jar package. Then repackage it as a jar package for our storm project.
3. Add the jar package and Kafka_2.9.2-0.8.0-beta1.jar Metrics-core-2.2.0.jar Scala-library-2.9.2.jar ( These three jar packages can be found in the Kafka-storm-0.8-plus project dependencies)
Note: If you are developing a project that requires additional jars, remember to put it into storm lib, such as using MySQL to add Mysql-connector-java-5.1.22-bin.jar to Storm's lib.
Flume and Kafka Integration
1. Download Flume-kafka-plus:https://github.com/beyondj2ee/flumeng-kafka-plugin
2. Extracting the Flume-conf.properties file from the plugin
Modify the File: #source section
Producer.sources.s.type = Exec
Producer.sources.s.command = Tail-f-n+1/mnt/hgfs/vmshare/test.log
Producer.sources.s.channels = C
Change the value of all topic to test
Put the changed configuration file into the flume/conf directory
In the project, extract the following jar packages into the environment under the flume Lib:
The above is a stand-alone version of the Flume+kafka+storm configuration installation
Flume+storm Plug-in
Https://github.com/xiaochawan/edw-Storm-Flume-Connectors
Start step
Start Project deployment start after installing Storm,flume,kafka (it's a good idea to follow the installation documentation for Storm Kafka flume individual component tests before deployment starts).
The first step
Put a well-written storm project into a jar package and place it on the server, if you put it in/usr/local/project/storm.xx.jar
Note: For a storm project, see Kafka and Storm consolidation in the installation documentation.
Step Two
Start the zookeeper (here you can start the Kafka with the zookeeper or start a separate installation of the Kafka, the following take Kafka to take the example)
Cd/usr/local/kafka
Bin/zookeeper-server-start.sh config/zookeeper.properties
Step Three
Start Kafka
Cd/usr/local/kafka
> bin/kafka-server-start.sh config/server.properties
Create a Theme
> bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test
Note: Because the offset of the Kafka message is managed by the zookeeper record, it is necessary to specify zookeeper's Ip,replica to indicate that the message of the topic has been copied several, partition that each topic is divided into sections. Test indicates the subject name.
Fourth Step
Start Storm
> Storm Nimbus
> Storm Supervisor
> Storm UI
cd/usr/local/project/
> Storm jar Storm.xx.jar storm.testtopology test
Note: Storm.xx.jar for us to write a good Storm project Jar package, the first step to complete the work. Storm.testtopology is the classpath of the main method in the Storm project. Test is the name of this topology.
Fifth Step
Start flume
Cd/usr/local/flume
Note: Flume.conf.properties for our custom flume configuration file, flume installation is not this file, we need to write their own, written in the way see Flume installation of the article.
Now that the program that needs to be started is all started and the Storm project is already running, you can turn on the storm UI to see if it's working.
http://localhost:8080
Note: The IP port of the machine IP for the storm Nimbus can be modified in the Storm profile Storm/conf/storm.yaml
This digest is titled "Flume+kafka+storm+mysql Architecture Design" from http://blog.csdn.net/mylittlered/article/details/20810265.
[Reprint] Building Big Data real-time systems using Flume+kafka+storm+mysql