Kafka is a distributed, high-throughput, information-fragmented storage, message-synchronous, open-source messaging service that provides the functionality of the messaging system, but with a unique design.
Originally developed by LinkedIn, Kafka is used in the Scala language as the activity stream data and operational data processing tool for LinkedIn, where activity flow data refers to the amount of page visits, content being viewed, and search conditions. Operational data refers to the performance data (CPU, IO usage, request time, service log, and so on) of the server.
Kafka has now been adopted by several different types of companies as a processing tool or Message Queuing service for various internal data. Today, Kafka donated to the Apache Software Fund as an open source project under Apache.
Let's review the basic elements of the messaging system:
1, Topic:kafka maintenance of a message called topic the subject;
2, Producer: We publish the message to Kafka topic, the process of releasing the message is also called producer producer;
3, Consumer: We subscribe to the message from the Kafka topic, the process of subscribing to the message is called the consumer consumer;
4, Broker:kafka run on a cluster of one or more servers, each server in the cluster is called broker. (Broker means: Broker, intermediary, agent)
So from the macro point of view, the producer (producer) through the network to publish messages to the Kafka cluster (cluster), Kafka cluster to the consumer (consumer) to provide message services, their processing flow as shown:
The producer, consumer, and Kafka broker servers communicate over the TCP protocol. Kafka is available in a variety of languages, such as client Java, C + +, Python, PHP, Ruby,. NET, Scala, Erlang, etc., for interacting with Kafka.
We call the release of the message (publish) The producer producer, the message subscription (subscribe) is called the consumer consumer, the intermediary storage service is called the broker, this broker is our Kafka server, The producer pushes the data to the Kafka server broker via push mode, and the consumer pulls the data from the Kafka server broker in pull mode, as shown in:
It is important to note that consumers pull data from the broker on their own initiative, and broker does not proactively send data to consumers.
In practical application, the broker general cluster deployment of producer, consumer and Kafka, cooperation between multiple producer, consumer and broker, through zookeeper coordination management, constitute a high-performance distributed message publishing and subscribing system, In a cluster, they are structured as shown in the following:
The entire distributed messaging system will run according to the following process:
1. Start Zookeeper
2. Start Kafka Broker
3, write the client producer production data, find the broker through zookeeper, and then deposit the data to broker
4, write client consumer consumption data, find the corresponding broker through zookeeper, then consume the message from broker.
Above we have a general description of the Kafka, let us from the macroscopic understanding of Kafka is what and its basic working principle, next we look at the Kafka involved in a few key elements are some of the features:
Topic, a high-level abstract Kafka provider, a Topic is a directory or message name, messages will be sent to Topic, for each topic,kafka maintain a shard log (partitioned log), as shown in a Topic anatomy diagram:
Each shard log is an ordered, immutable message series that is constantly added to the end of the Shard log, in which each message is assigned a sequential ID of Howl offset, which uniquely defines a message.
Within a configured time, the Kafka cluster retains all posted messages, whether or not they are consumed. For example, to be set aside for two days, the message will be consumed within two days after a message is released, and after two days the message is discarded to free up space.
There are several reasons for partitions design in Kafka, first of all, through partition can make the log file size does not exceed the file capacity limit of a single machine, a topic can have multiple partitions, so can store any number of data. Second, can increase the ability of concurrent consumption, a topic partitions can be distributed in the Kafka cluster of multiple machines, each machine Kafka instance is responsible for the Shard data on the machine request and operation, each shard can be configured to copy the number of copies, Replication to other machines in the cluster to improve high availability.
Each partition has one server as leader,0 or multiple servers as followers,leader processing all read and write requests, followers replication leader for message synchronization, if a leader fails, Followers will automatically have a follower into the new leader.
Producers send messages to the topics of their choice, producers can also specify which partition to send to topic, and you can decide which round-robin to send messages to by partition or other algorithms.
The traditional messaging system has two modes:
One is a peer-to-peer message based on queue queues.
One is the publication and subscription message based on the subject topic.
Queue-based point-to-point messages can only be consumed by one consumer, while topic-based publishing and subscription messages are consumed by multiple consumers. Kafka abstracts the two patterns, which use the consumer group name to handle the two patterns, dividing the consumer into groups, each consumer a separate consumer group, You can also have multiple consumer belonging to the same group.
If you take a queue-based point-to-point message, each consumer needs to be in the same group, and if you take a topic-based publication subscription message, each consumer is in a different group.
In more cases, our theme topic contains several consumer group, each consumer group is a logical subscribe, each group contains multiple consumer instances, which is more scalable and fault tolerant. Kafka has a more robust subscription mechanism than the traditional messaging system.
Kafka provides the following assurances at a higher level:
1. Messages sent by producer producer to topic partition will be appended to the log in sequence;
2, consumers consumer the order of consumer messages and messages in the order of the log in the same;
3. If a topic has n copies, we can allow the N-1 service to fail without losing any messages.
Some common application scenarios for Kafka:
1, message system: Replace the traditional message system, decoupling the system or cache the data to be processed, Kafka has better throughput, built-in sharding, replication, fault tolerance mechanism, is a better solution for large-scale data message processing.
2, website activity tracking: Site visits, search volume, or other user activity behavior such as registration, recharge, payment, purchase and other behavior can be published to the center of the topic, each type can be as a topic, these information flow can be the consumer subscription real-time processing, Real-time monitoring or loading of data streams into Hadoop for offline processing.
3, Metric statistics: can be used to measure statistics of some operations and maintenance monitoring data, a number of distributed monitoring data gathered together.
4, log aggregation: can be used as a log aggregation replacement scheme, such as Scribe, Flume.
5, Data stream processing: The data can be graded processing, the original data obtained from the Kafka processing and then released to Kafka.
6, the source of the event: You can record the status change of the application event in chronological order, thus tracing the event.
7. Submit log: Can be used as external log storage medium for distributed system.
Of course, there may be other ways to use Kafka more skillfully.
A Java technology Exchange Group, communicate together, progress together, Buckle Group No.: 513086638
Public platform:
kafka--high-performance distributed messaging system