Personal opinion: Big data we all know about Hadoop, but not all of it. How do we build a large database project. For offline processing, Hadoop is still more appropriate, but for real-time, relatively strong, the amount of data is large, we can use storm, then storm and what technology collocation, to be able to do a suitable project. We can refer to the following.
You can read this article with the following questions:
1. What are the characteristics of a good project architecture?
2. How does the project structure ensure the accuracy of the data?
3. What is Kafka?
How does 4.flume+kafka integrate?
5. What script to use to see if flume transfer data to Kafka
Software development is aware of the modularity of the idea, the reasons for this design are two aspects:
On the one hand, it can be modularized, the function is divided more clearly, from the "Data acquisition--access--loss calculation-data output/storage"
1). Data acquisition is responsible for collecting data from each node in real-time, choosing Cloudera Flume to achieve 2). Access to information is not necessarily synchronous due to the speed of data acquisition and the speed of data processing, so add a message middleware to use as a buffer, using Apache's kafka3). Stream-based computing for real-time analysis of collected data , using Apache's STORM4). Data output to the analysis of the results of persistent, tentative with MySQL on the other hand, after the modularization, if the storm hangs, data acquisition and access or continue to run, the data will not be lost, storm up can continue to flow calculation;
So let's take a look at the overall architecture diagram.
Detailed description of each component and installation configuration: Operating system: Ubuntu
Flumeflume is a distributed, reliable, and highly available log collection system for Cloudera, which supports the customization of various types of data senders in the log system for data collection, while Flume provides simple processing of data The ability to write to various data-receiving parties (customizable). Typical architecture for flume: Flume data Source and output mode: Flume provides from console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log System, Support 2 modes such as TCP and UDP, exec (command execution) and other data sources on the ability to collect data, in our system is currently using the Exec method of log capture.
Flume data recipients, which can be console (console), text (file), DFS (HDFs file), RPC (THRIFT-RPC), and syslogtcp (TCP syslog log system), and so on. It is received by Kafka in our system.
Flume Download and Documentation: Http://flume.apache.org/Flume installation:
- $tar ZXVF apache-flume-1.4.0-bin.tar.gz/usr/local
Flume Start command:
- $bin/flume-ng agent--conf conf--conf-file conf/flume-conf.properties--name Producer-dflume.root.logger=info, Console
Kafka is a high-throughput distributed publish-subscribe messaging system that has the following features:
- Provides persistence of messages through the disk data structure of O (1), a structure that maintains long-lasting performance even with terabytes of message storage.
- High throughput: Even very common hardware Kafka can support hundreds of thousands of messages per second.
- Support for partitioning messages through Kafka servers and consumer clusters.
- Supports Hadoop parallel data loading.
The purpose of Kafka is to provide a publishing subscription solution that can handle all the action flow data in a consumer-scale website. This kind of action (web browsing, search and other user actions) is a key factor in many social functions on modern networks. This data is usually resolved by processing logs and log aggregations due to throughput requirements. This is a viable solution for the same log data and offline analysis system as Hadoop, but requires real-time processing constraints. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time consumption through the cluster machine. Kafka distributed subscription architecture, such as:--from the Kafka official Treasure Brothers article on the structure of the diagram is the fact that the two are not much different, the official website of the architecture is just the Kafka concise representation as a Kafka Cluster, and the above architecture diagram is relatively detailed;
Kafka version: 0.8.0Kafka download and Documentation: HTTP://KAFKA.APACHE.ORG/KAFKA installation:
- > Tar xzf kafka-<version>.tgz
- > CD kafka-<version>
- >./SBT Update
- >./SBT Package
- >./SBT assembly-package-dependency
Start and Test commands: (1) Start server
- > bin/zookeeper-server-start.shconfig/zookeeper.properties
- > bin/kafka-server-start.shconfig/server.properties
Here is the official web tutorial, Kafka itself has built-in zookeeper, but I myself in the actual deployment is the use of a separate zookeeper cluster, so the first line of command I did not execute, here are just some to show you.
Configuring a standalone zookeeper cluster requires configuring the Server.properties file, speaking zookeeper.connect modifying the IP and port of the standalone cluster
(2) Create a topic
- > bin/kafka-create-topic.sh--zookeeper localhost:2181--replica 1--partition 1--topic test
- > bin/kafka-list-topic.sh--zookeeperlocalhost:2181
(3) Send some messages
- > bin/kafka-console-producer.sh--broker-list localhost:9092--topic Test
(4) Start a consumer
- > Bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic Test--from-beginning
Kafka-console-producer.sh and kafka-console-cousumer.sh are just the system-provided command-line tools. This is done to test the normal production and consumption; Verify the correctness of the process in the actual development of the development of their own producers and consumers; Kafka installation can also refer to the article I wrote earlier: http://blog.csdn.net/weijonathan/ Article/details/18075967stormtwitter is officially open source for Storm, a distributed, fault-tolerant, real-time computing system that is hosted on GitHub and follows the Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure.
The main features of Storm are as follows:
- A simple programming model. Similar to mapreduce reduces the complexity of parallel batching, storm reduces the complexity of real-time processing.
- You can use a variety of programming languages. You can use a variety of programming languages on top of storm. Clojure, Java, Ruby, and Python are supported by default. To increase support for other languages, simply implement a simple storm communication protocol.
- Fault tolerance. Storm manages the failure of worker processes and nodes.
- Horizontal expansion. Calculations are performed in parallel between multiple threads, processes, and servers.
- Reliable message handling. Storm guarantees that each message can be processed at least once. When a task fails, it is responsible for retrying the message from the message source.
- Fast. The design of the system ensures that the message can be processed quickly, using ØMQ as its underlying message queue. (0.9.0.1 version supports both ØMQ and Netty two modes)
- Local mode. Storm has a "local mode" that can fully simulate storm clusters during processing. This allows you to quickly develop and unit test.
Due to the space problem, the specific installation steps can be consulted: Storm-0.9.0.1 installation Deployment Guide The next play starts pulling! That's the integration between the frames.
Flume and Kafka integration 1. Download flume-kafka-plus:https://github.com/beyondj2ee/ Flumeng-kafka-plugin2. flume-conf.properties file in the Extract plugin modify the file: #source Sectionproducer.sources.s.type = exec
Producer.sources.s.command = Tail-f-n+1/mnt/hgfs/vmshare/test.log
Producer.sources.s.channels = c Change the value of all topic to test change the configuration file into the flume/conf directory and extract the following jar packages into the environment under the flume Lib in the project:
Note: Here's Flumeng-kafka-plugin.jar this package, which has been moved to the packages directory later on in the GitHub project. Children's shoes that cannot be found can be obtained from the package directory.
After completing the above steps, we will test the next Flume+kafka this process is not going through, we start flume, and then start Kafka, the start step to follow the previous steps Next we use Kafka's kafka-console-consumer.sh script to see if there is flume to transfer data to Kafka;
Above this is my Test.log file through flume crawl to Kafka data; Our flume and Kafka processes go through; Do you remember the first step of our flowchart, one of which is through Flume to Kafka, and one step to the HDFs , and our side has not mentioned how to deposit Kafka and at the same time as Hdfs;flume is supported data synchronization replication, synchronous replication flow chart, taken from the Flume official website, website User Guide address: http://flume.apache.org/FlumeUserGuide.html
How to set up synchronous replication, look at the following configuration:
- #2个channel和2个sink的配置文件 Here we can set up two sink, one is Kafka, the other is HDFs;
- A1.sources = R1
- A1.sinks = K1 K2
- A1.channels = C1 C2
The specific configuration of the guys according to their own needs to set, here is not specific examples of
Integration of Kafka and Storm
1. Download kafka-storm0.8 plugin: https://github.com/wurstmeister/storm-kafka-0.8-plus2. Compile with maven package, Get Storm-kafka-0.8-plus-0.3.0-snapshot.jar Bag--There are reproduced children's shoes note, here the package name before the wrong, now correct! Excuse me! 3. Add the jar package and Kafka_2.9.2-0.8.0-beta1.jar, Metrics-core-2.2.0.jar, Scala-library-2.9.2.jar (these three jar packages can be found in the Kafka project) Note: If you are developing a project that requires additional jars, remember to put it into storm lib, such as using MySQL to add Mysql-connector-java-5.1.22-bin.jar to Storm's lib. Then we'll restart the storm. After completing the above steps, we have one more thing to do, which is to use the kafka-storm0.8 plugin to write a storm program of your own; Here I have a storm program I made, Baidu Network disk share address: Link: http ://pan.baidu.com/s/1jgbp99w Password: 9arq First look at the program's Creation topology code
Data operations are primarily in the WordCounter class, where only simple JDBC is used for insert processing
Here you just need to enter a parameter as the topology name! We use local mode here, so do not input parameters, directly see whether the process is going through;
- Storm-0.9.0.1/bin/storm jar Storm-start-demo-0.0.1-snapshot.jar Com.storm.topology.MyTopology
Let's look at the log, print it out, insert data into the database.
Then we look at the database and insert it successfully!
Our entire integration is complete here! But there is a problem here, I do not know whether they have found. Since we use storm for distributed streaming computing, the most important thing to note about distribution is data consistency and avoiding the generation of dirty data, so the test projects I provide can only be used for testing, and formal development cannot handle this. Morning color Sky EE (a network name) give the suggestion is to establish a zookeeper distributed global lock, ensure data consistency, avoid dirty data entry! Zookeeper Client Framework We can use Netflix curator to do it, because this piece I haven't seen, so I can only write here!
Big Data architecture: FLUME-NG+KAFKA+STORM+HDFS real-time system combination