This article is a self-summary of learning, used for later review. If you have any mistake, don't hesitate to enlighten me.
Here are some of the contents of the blog: http://blog.csdn.net/ymh198816/article/details/51998085
Flume+kafka+storm+redis Real-time Analysis system basic Architecture
1) The architecture of the entire real-time analysis system is
2) The Order log is generated by the order server of the e-commerce system first,
3) Then use Flume to listen to the Order log,
4) and real-time each log information to crawl down and into the Kafka message system,
5) Then the storm system consumes the messages from the Kafka,
6) The consumption record is managed by the zookeeper cluster so that the last consumption record can be found even after the Kafka outage, and then continue to consume from the Kafka broker from the last outage point. However, due to the existence of the first consumption after recording the log or the first recorded after consumption of non-atomic operations, if the occurrence of just consume a message and have not recorded the information to the zookeeper of the time of the similar problem of downtime, more or less there will be a small amount of data loss or repeated consumption problems, One solution is for the Kafka broker and zookeeper to be deployed on the same machine.
7) The next step is to use the user-defined storm topology to analyze the log information and output it to the Redis cache database (which can also be persisted), and finally use the Web app to read the analyzed order information in Redis and present it to the user.
The reason for adding a layer of Kafka messaging system in the middle of flume and Storm is that, with high concurrency, the data for the Order Log will grow in a blowout, and if Storm's consumption rate (the real-time computing power of Storm is one of the fastest, there are exceptions, And it is said that now Twitter's open-source real-time computing framework heron than storm faster than the speed of the log, coupled with Flume's own limitations, will inevitably lead to a lot of data lag and loss, so the Kafka message system as a data buffer, and Kafka is based on the log The file's message system, which means that the message can be persisted on the hard disk, plus its ability to take full advantage of the I/O characteristics of Linux, provides considerable throughput. The use of Redis as a database in the architecture is also due to the high read and write speeds of redis in real-time environments.
Flume Compare with Kafka
(1) Kafka and Flume are log systems. Kafka is a distributed message middleware, with its own storage, providing push and pull access data functions. The flume is divided into agent (Data collector), collector (simple processing and writing), Storage (memory) Three parts, each part can be customized. For example, the agent uses RPC(THRIFT-RPC), text (file), etc.,storage specified in HDFs.
(2) Kafka log cache should be more appropriate, but the Flume data collection part is done well, you can customize a lot of data sources, reduce the amount of development. So the more popular Flume+kafka mode, if in order to use Flume to write HDFs ability, can also adopt Kafka+flume way.
Flume
- Flume is the July 2009 Open Source log system. It has a wide range of built-in components that users can use with little or no additional development. is a distributed log collection system that collects data from individual servers and sends them to designated locations, such as HDFs
- Flume Features
1) reliability
When a node fails, the log can be transmitted to other nodes without loss. Flume provides three levels of reliability assurance, from strong to weak in order:end-to-end( received data agent first writes the event to disk, when the data transfer is successful, then delete; If the data sent fails, you can resend ),Store on Failure(this is also the strategy adopted by scribe, when the data receiver crash, writes the data to the local, after the recovery, continues to send), besteffort(after the data is sent to the receiver, will not be confirmed)
2) Scalability
Flume using a three-tier architecture, ask Agent,collector and storage respectively, each layer can be horizontally expanded. all agents and collector are managed by master, which makes the system easy to monitor and maintain, and master allows multiple (management and load balancing using zookeeper), which avoids a single point of failure.
3) Manageability
all agents and Colletor are managed centrally by master, which makes the system easy to maintain. users can view individual data sources or data flow executions on master, and can be configured and dynamically loaded on individual data sources.
4) functional Scalability
Users can add their own agent,colletor or storage as needed.
3. Flume Architecture
The flume uses a layered architecture consisting of three layers:agents, collector, and storage. The agent and collector are composed of two parts: source and Sink,source are data sources, and sink is the data whereabouts.
The core of Flume is the Agent process, which is a Java process running on a server node .
Agent : Sending data from a data source to collector
collector : After aggregating data from multiple agents, load to storage. Its source and sink are similar to agents
Storage : Storage System, can be a normal file, it can be hdfs,hive,hbase and so on.
Source (data source): used to collect various data
Channel : Temporary storage data, can be stored in memory, JDBC, file, etc.
Sink : Send data to destinations such as HDFs, hbase, etc.
Flume The basic unit of the transmitted data is the event, and the transaction is guaranteed at the event level, and the event encapsulates the transmitted data
The channel will only delete the temporary data after the sink has successfully sent the data in the channel, which guarantees the reliability and security of the data transmission.
4. Generalized usage of flume
Flume supports multi-level Flume agents, that is, sink can write data to the next agent's source,
and Flume supports fan-in (source can accept multiple inputs), fanout (sink can output data to multiple destinations)
A complex example is the following: There are 6 agent,3 collector, and all collector import data into HDFs. Agent A B sends the data to collector A,agent C,d sends the data to Collectorb,agent C,d sends the data to Collectorb. At the same time, add end-to-end reliability assurance for each agent, and if collector a fails, agent A and Agent B will send the data to Collector B and collector C respectively.
Kafka
- Kafka is an open source project in December 2010, written in Scala, with Push/pull architecture, which is more suitable for the transfer of heterogeneous cluster data.
- Kafka Features
Persistent message: No information is lost, providing stable terabytes of message storage
High throughput: Kafka design works on commercial hardware, providing millions of messages per second
Distributed architecture, capable of partitioning messages
Real-time: messages produced by producer threads are immediately visible to consumers, and the cost of accessing data on disk is O (1)
3. Kafka Architecture
Kafka is actually a message publishing subscription system. Kafka the message in topic , the program that publishes the message to topic as producer, and the message is scheduled as consumer. Kafka is run as a cluster and can consist of one or more services, each of which is called a broker. Once there is a new message about a topic, the broker is passed to all consumer that subscribe to it. In Kafka, messages are organized by topic, and each topic is divided into multiple partition, which makes it easy to manage data and load balance . At the same time, it uses zookeeper for load balancing.
1) Producer
Sends data to the broker.
The Kafka provides two producer interfaces:
A) Low_level interface for sending data to a partition under a certain topic of a particular broker;
b) High level interface, supports synchronous/asynchronous sending of data, zookeeper based broker automatic recognition and load balancing (based on partitioner). Producer can obtain a list of available brokers through zookeeper, or you can register listener in zookeeper, which adds a delete broker, When registering a new topic or broker registering an existing topic, it is awakened: When producer learns the above time, it can take certain actions as needed.
2) Broker
Broker has adopted a variety of strategies to improve data processing efficiency, including sendfile and zero copy technologies.
3) Consumer
Load the log information on the central storage System.
The Kafka provides two consumer interfaces:
A) Low-level interface: maintains a connection to a broker, and the connection is stateless, telling the broker the offset of the data each time it is pulling data from the broker.
b) High Level interface: Hides the details of the broker, allowing consumer to push data from the broker without having to care about the network topology. More importantly, for most log systems, the data information that consumer has acquired is saved by the broker, while in Kafka, the data information is maintained by consumer itself
4. Kafka Message Sending process
1) Producer publish the message to the partition of the specified topic according to the specified partition method
2) After the cluster receives the message sent by producer, it persists to the hard disk and retains the message for a specified length of time, regardless of whether the message is consumed.
3) Consumer pull data from the Kafka cluster and control the offset of the get message
Detailed process:
Kafka is a distributed, high-throughput messaging system that simultaneously has a bit-to-point and post-subscription two message consumption patterns.
Kafka is mainly composed of Producer,consumer and broker. A concept called "topic" is introduced in Kafka to manage different kinds of messages, and different categories of messages are recorded in their corresponding topic pools. The messages entered into topic are persisted in the log file written to the disk by Kafka. For each topic message log file, Kafka will fragment it. Each message is written sequentially in the log shard and is labeled "offset" to represent the order of the message in the Shard, and the messages are immutable in both content and order. So one of the differences between Kafka and other Message Queuing systems is that it enables the messages in a shard to be consumed sequentially, but there is a limitation in order to be globally ordered, unless the entire topic has only one log shard. This message is kept in the log file regardless of whether the message is consumed, and the message is deleted to free up space after the retention time is long enough to the retention specified in the configuration file. For each Kafka consumer, the only Kafka related metadata they have to save is the "offset" value, which records where consumer is consuming the Shard. Usually Kafka is to use zookeeper to save their offset information for each consumer, so a zookeeper cluster is required before starting Kafka, and Kafka defaults to the policy of recording offset and then reading the data. This strategy has the potential for a small amount of data loss. However, the user can flexibly set the location of the consumer "offset", in addition to the message recorded in the log file, it is possible to repeat the consumption of messages. Log shards and their backups are scattered across the cluster's servers, and for each partition, there will be a partition server on the cluster as leader, While the other backup of this partitionpartition is Follower,leader responsible for handling all requests for this partition, and follower is responsible for synchronizing the other backups of this partition, When the leader server goes down, one of the follower servers is elected as a new leader.
How data is delivered
1) Socket: The simplest way of interaction, typical C/s interaction mode. The transport protocol can be TCP/UDP
Advantages: Easy to program, Java has a lot of frameworks, hidden details, easy to control permissions, through HTTPS, so that security improved;
Disadvantage: The server and client must be online at the same time, when the amount of data transmitted is large, the network bandwidth is severely occupied, which causes the connection timeout.
2) ftp/ file Sharing server Mode : For large data volume interaction
Advantages: The data volume is large, does not time out, does not occupy the network bandwidth, the scheme is simple, avoids the network transmission, the network protocol related concept
Disadvantage: not suitable for real-time class business, there must be a common server, there may be file leaks; The format of the file data must be agreed
3) database sharing Data Mode : System A, b through the same database server connected to the same table for data exchange
Pros: Use the same database, make the interaction more simple, interactive way flexible, updatable, rollback, because the database transaction, the interaction more reliable
Disadvantage: When the system connecting B is more and more, it will cause the connection of each system to be not many;
In general, the systems of both companies will not open their own databases to each other, affecting security
4) message mode : TheJava Message Service is a typical implementation of message data transfer
Advantages: JMS defines the specification, there are many message middleware optional, the message is more flexible, can take synchronous, asynchronous, reliable message processing
Cons: JMS-related learning has a certain learning cost for development; In the case of large data volumes, message backlogs, delays, loss, and even middleware crashes can occur
1. Message Queuing
Any problem with software engineering can be solved by adding an intermediate layer
Message Queuing is the container in which messages are saved during the transmission of a message. The main purpose is to provide a route and guarantee the delivery of the message , and if the recipient is unavailable when the message is sent, Message Queuing retains the message until it can be successfully passed.
2. Message Middleware Role
System Decoupling: Service B problem does not affect service a
peak Load shifting : Peak load shifting the request pressure and reduces the system peak pressure
data exchange : Data exchange is possible without exposing the intranet of enterprise A and B
Asynchronous Notifications : Reduce the number of unnecessary polling requests between front-end and back-end services
timed Tasks : 30 minutes delay if generating a payment check task
Flume and Kafka