It's been a long time, but it's a very mature architecture.
General data flow, from data acquisition-data access-loss calculation-output/Storage
<ignore_js_op>
1). Data acquisitionresponsible for collecting data in real time from each node and choosing Cloudera Flume to realize2). Data Accessbecause the speed of data acquisition and the speed of data processing are not necessarily synchronous, a message middleware is added as a buffer, using Apache's Kafka3). Flow-based computingReal-time analysis of collected data, using Apache's Storm4). Data Outputpersistent with the results of the analysis, tentatively using MySQLOn the other hand, after the modularization, if the storm has been hung out, data acquisition and data access will continue to be running, the information is not lost, storm up can continue to flow calculation;
so let's take a look at the overall architecture diagram.<ignore_js_op>
detailed description of each component and installation configuration:operating system: Ubuntu
FlumeFlume is a distributed, reliable, and highly available log collection system for Cloudera, which supports the customization of various data senders in the log system and collects data, while Flume provides simple processing of data and writes to various data recipients (customizable) capabilities. typical architecture for flume:flume data source and output mode:Flume provides 2 modes from console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log system, TCP and UDP support), EXEC (command execution) The ability to collect data on a data source is currently used by exec in our system for log capture.
Flume data recipients, which can be console (console), text (file), DFS (HDFs file), RPC (THRIFT-RPC), and syslogtcp (TCP syslog log system), and so on. It is received by Kafka in our system.
flume Download and Documentation:http://flume.apache.org/Flume Installation:
- $tar ZXVF apache-flume-1.4.0-bin.tar.gz/usr/local
Copy CodeFlume Start command:
- $bin/flume-ng agent--conf conf--conf-file conf/flume-conf.properties--name Producer-dflume.root.logger=info, Console
Copy CodeKafka
Kafka is a high-throughput distributed publish-subscribe messaging system that has the following features:
- Provides persistence of messages through the disk data structure of O (1), a structure that maintains long-lasting performance even with terabytes of message storage.
- High throughput: Even very common hardware Kafka can support hundreds of thousands of messages per second.
- Support for partitioning messages through Kafka servers and consumer clusters.
- Supports Hadoop parallel data loading.
The purpose of Kafka is to provide a publishing subscription solution that can handle all the action flow data in a consumer-scale website. This kind of action (web browsing, search and other user actions) is a key factor in many social functions on modern networks. This data is usually resolved by processing logs and log aggregations due to throughput requirements. This is a viable solution for the same log data and offline analysis system as Hadoop, but requires real-time processing constraints. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time consumption through the cluster machine. Kafka distributed subscription architecture such as:--taken from Kafka official website<ignore_js_op>The architecture diagram on the Luobao brothers article is like this<ignore_js_op>in fact, the two are not much different, the structure of the official website is just Kafka concise representation into a Kafka Cluster, and the above structure diagram is relatively detailed;
Kafka version: 0.8.0Kafka Download and Documentation: http://kafka.apache.org/Kafka Installation:
- > Tar xzf kafka-<version>.tgz
- > CD kafka-<version>
- >./SBT Update
- >./SBT Package
- >./SBT assembly-package-dependency
Copy CodeStart and test commands:(1) Start server
- > bin/zookeeper-server-start.shconfig/zookeeper.properties
- > bin/kafka-server-start.shconfig/server.properties
Copy Codehere is the official web tutorial, Kafka itself has built-in zookeeper, but I myself in the actual deployment is the use of a separate zookeeper cluster, so the first line of command I did not execute, here are just some to show you.
Configuring a standalone zookeeper cluster requires configuring the Server.properties file, speaking zookeeper.connect modifying the IP and port of the standalone cluster
- zookeeper.connect=nutch1:2181
Copy Code(2) Create a topic
- > bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test
- > bin/kafka-list-topic.sh--zookeeperlocalhost:2181
Copy Code(3) Send some messages
- > bin/kafka-console-producer.sh--broker-list localhost:9092--topic Test
Copy Code(4) Start a consumer
- > Bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic Test--from-beginning
Copy Codekafka-console-producer.sh and kafka-console-cousumer.sh are just the system-provided command-line tools. This is done to test the normal production of consumption; Verify process correctnessin the actual development of the self-development of their own producers and consumers;Kafka installation can also refer to the article I wrote earlier: http://blog.csdn.net/weijonathan/article/details/18075967StormTwitter is officially open source for Storm, a distributed, fault-tolerant, real-time computing system that is hosted on GitHub and follows the Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure. <ignore_js_op>
The main features of Storm are as follows:
- A simple programming model. Similar to mapreduce reduces the complexity of parallel batching, storm reduces the complexity of real-time processing.
- You can use a variety of programming languages. You can use a variety of programming languages on top of storm. Clojure, Java, Ruby, and Python are supported by default. To increase support for other languages, simply implement a simple storm communication protocol.
- Fault tolerance. Storm manages the failure of worker processes and nodes.
- Horizontal expansion. Calculations are performed in parallel between multiple threads, processes, and servers.
- Reliable message handling. Storm guarantees that each message can be processed at least once. When a task fails, it is responsible for retrying the message from the message source.
- Fast. The design of the system ensures that the message can be processed quickly, using ØMQ as its underlying message queue. (0.9.0.1 version supports both ØMQ and Netty two modes)
- Local mode. Storm has a "local mode" that can fully simulate storm clusters during processing. This allows you to quickly develop and unit test.
due to space issues, the specific installation steps can be consulted: Storm-0.9.0.1 Installation Deployment GuideThe next play starts pulling! That's the integration between the frames .
Flume and Kafka integration1. Download Flume-kafka-plus:https://github.com/beyondj2ee/flumeng-kafka-plugin2. Extracting the Flume-conf.properties file from the pluginModify the file: #source sectionProducer.sources.s.type = exec
Producer.sources.s.command = tail-f-n+1/mnt/hgfs/vmshare/test.log
producer.sources.s.channels = Cchange the value of all topic to testput the changed configuration file into the flume/conf directoryin the project, extract the following jar packages into the environment under the flume Lib:Note: Here's Flumeng-kafka-plugin.jar this package, which has been moved to the packages directory later on in the GitHub project. Children's shoes that cannot be found can be obtained from the package directory.
after completing the above steps, we will test the next Flume+kafka the process is not going through;We start flume, and then start the Kafka, the start step to follow the previous steps, and then we use the Kafka kafka-console-consumer.sh script to see if there is flume to transmit data to Kafka;<ignore_js_op>above this is my Test.log file through flume crawl to Kafka data, show our flume and Kafka process go through;Do You remember the beginning of our flowchart, one of the steps is through the flume to Kafka, and one step is to HDFs, and our side has not mentioned how to deposit Kafka and at the same time as HDFs;Flume is support data synchronous replication, synchronous replication flowchart is as follows, taken from Flume official website, official website User Guide address: http://flume.apache.org/FlumeUserGuide.html<ignore_js_op>how to set up synchronous replication, look at the following configuration:
- #2个channel和2个sink的配置文件 Here we can set up two sink, one is Kafka, the other is HDFs;
- A1.sources = R1
- A1.sinks = K1 K2
- A1.channels = C1 C2
Copy Codethe specific configuration of the guys according to their own needs to set, here is not specific examples of
integration of Kafka and Storm
1. Download kafka-storm0.8 plugin: Https://github.com/wurstmeister/storm-kafka-0.8-plus2. Use maven package to compile, get Storm-kafka-0.8-plus-0.3.0-snapshot.jar pack --There are reproduced children's shoes note, here the package name is wrong before, now correct! Excuse me! 3. Add the jar package and Kafka_2.9.2-0.8.0-beta1.jar, Metrics-core-2.2.0.jar, Scala-library-2.9.2.jar (these three jar packages can be found in the Kafka project) Note: If you are developing a project that requires additional jars, remember to put it into storm lib, such as using MySQL to add Mysql-connector-java-5.1.22-bin.jar to Storm's lib.So then we'll restart storm.after completing the above steps, we have one more thing to do, is to use the kafka-storm0.8 plugin, write your own storm program;Here I give everyone to attach a I get the Storm program, Baidu Network disk share address: Link: Http://pan.baidu.com/s/1jGBp99W Password: 9arqfirst look at the program's Creation topology code<ignore_js_op>data operations are primarily in the WordCounter class, where only simple JDBC is used for insert processing<ignore_js_op>Here you just need to enter a parameter as the topology name! We use local mode here, so do not input parameters, directly see whether the process is going through;
- Storm-0.9.0.1/bin/storm jar Storm-start-demo-0.0.1-snapshot.jar Com.storm.topology.MyTopology
Copy Codelet's look at the log, print it out, insert data into the database.<ignore_js_op>then we look at the database and insert it successfully! <ignore_js_op>
Our entire integration is complete here! But there is a problem here, I do not know whether they have found. Since we use storm for distributed streaming computing, the most important thing in distributed distribution is the consistency of data and the avoidance of dirty data, so the test project I provide can only be used for testing, and formal development cannot handle it.Morning color Sky ee (a network name) gives the suggestion is to establish a zookeeper distributed global lock, ensure data consistency, avoid dirty data entry! Zookeeper Client Framework we can use Netflix curator to do it, because this piece I haven't seen, so I can only write here!
Turn: Big Data architecture: FLUME-NG+KAFKA+STORM+HDFS real-time system combination