Kafka Getting Started Guide

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kafka is a distributed streaming platform, what exactly does it mean.

The streaming platform has the following three main functions:
☆ Publish and subscribe stream records, similar to Message Queuing or enterprise-level messaging systems.
☆ You store stream records in a fault-tolerant manner.
☆ Timely processing when the flow record is generated.

Kafka is used in two major categories of applications:
☆ Establish a real-time streaming data channel, which can reliably obtain data between the system or the application
☆ Set up real-time streaming applications to convert streaming data or stream data to react.

To understand how Kafka can do these things, let's dive into the features of Kafka from below:
First look at these concepts:
☆kafka runs as a cluster on one or more servers, spanning multiple datacenters.
Stream records for ☆kafka cluster storage are categorized by subject.
☆ Each record contains a key, a value and a timestamp.
The Kafka has four core APIs:
☆ Producer API allows an app to publish a stream record to one or more Kafka topics.
☆ Consumer API allows an app to subscribe to one or more topics and process flow records.
The ☆streams API allows an application to be used as a stream processor, consuming one input stream for one or more topics, and producing an output stream to one or more output topics to effectively convert the input stream into an output stream.
The ☆connector API allows you to build and run reusable producers or consumers who connect Kafka themes to existing applications or data systems. For example, a relational database connector (Connector) can capture every change in a table.

The communication between the Kafka client and the Kafka server is accomplished through a simple, high-performance, language-independent TCP protocol. The protocol is version-controlled and remains backwards compatible with older versions. We provide a Java version of the client for Kafka, in fact the client can be implemented in many other languages.

Topics and logs
First, delve into the core concepts that Kafka provides for streaming records-topics.

The topic is to categorize or name the records that are published. The topics in Kafka always have multiple subscribers. That is, a topic can have 0, one or more consumers to subscribe to the data in this topic.
For each topic, the Kafka cluster maintains a partition log such as the following:

Each partition is an ordered, unchanging sequence of records that is continuously appended to the structured log. The records for a partition are assigned a sequential ID number, called an offset, that uniquely identifies each record within the partition.

Kafka clusters use a configurable shelf life to persist all published records, regardless of whether they have been consumed or not. For example, if the save policy is set to two days, and then the record is published within two days, the record can be consumed, after which it will be discarded to free up space. Regardless of the size of the data, Kafka can maintain stable performance, so storing data for a long time is not a problem.

In fact, the only metadata that is retained on a per-consumer basis is the offset and position of the consumer in the journal. Offsets are controlled by the consumer: Typically, when a consumer reads a record, it increases the offset linearly. But in fact, because the location is controlled by consumers, consumers can consume records in any order. For example, consumers can reset offsets to re-process previous data, or jump to the nearest record and start spending from "now."
These traits taste Kafka consumers are flexible-they can join or leave without affecting the cluster or other consumers. For example, you can use our command-line tools to track the content of any topic without affecting the consumption of existing consumers.
Partitions in the log are used for a variety of purposes. First, the partition allows the size of the log to exceed the server stand-alone limit. Each individual partition must be suitable for the server hosting it, but a topic may have many partitions, so it can handle any number of data. Second, each partition is a parallel unit-more to achieve this.

Distributed
The partitions of the logs are distributed across the servers in the Kafka cluster, and each server processes the data and requests the shared partition. Each partition is replicated through a configurable number of servers for fault tolerance.
Each partition has a server that acts as a leader (leader), and 0 or more servers act as followers (follower). Leader handles all read and write requests for the partition, followers passively copy the leader. If leader fails, one of the follower automatically becomes the new leader. Each server acts as the leader of some of the partitions and the follower of its partitions, so load balancing is good within the cluster.

geo-Replication

Kafka Mirrormaker provides geo-replication support for your cluster. With Mirrormaker, messages can be replicated across multiple data centers or cloud regions. You can use it for backup and recovery in an active/passive scenario, or to place data closer to the user in an active/active scenario, or to support data localization requirements.

Producers
Producers publish data to the subject they choose. The producer is responsible for choosing which record to assign to which partition in the topic. This can be done in a circular fashion, just to balance the load, or it can be done based on a semantic partitioning feature, such as a key based on a record.

Consumers

Consumers mark themselves with the consumer group (consumer group) name, and each record posted to the topic is routed to a consumer instance in each subscription consumer group. Consumer instances can be in a separate process or on a separate machine.
If all consumer instances have the same consumption group, the record is effectively load balanced on the consumer instance.
If all consumer instances have different consumer groups, each record is broadcast to all consumer processes.
　　

Two servers Kafka cluster hosts four partitions (P0-P3) and two consumer groups. Consumer group A has two consumer instances, while consumer group B has four consumer instances.
More commonly, however, we find that there are few consumer groups of topics, and each "logical subscriber" has one. Each consumer group group is comprised of many consumer instances with scalability and fault tolerance. This is just the publish-subscribe semantics, where subscribers are a group of consumers rather than a process.
The way to consume in Kafka is to allocate the partitions in the logs to consumer instances so that each instance is an exclusive consumer of the "fair share" partition at any point in time. The process of maintaining members in a group is handled dynamically by the Kafka protocol. If a new instance joins the group, they will take over some of the partitions from other members of the group, and if one instance dies, its partition will be assigned to the remaining instances.
Kafka provides only the total order of records within a partition, not the order of the different partitions in the theme. The ability to sort by partition and key partitioning data is sufficient to meet the needs of most applications. However, if you want the total order of all records, you can do so by using only one partition theme, but this means that each consumer group has only one consumer process.

Multi-tenancy

You can deploy Kafka as a multi-tenant solution. Enable multi-tenancy by configuring which themes can generate or consume data. There is also quota operation support. Administrators can define and execute quotas on request to control the proxy resources used by clients. For more information, see Security documentation.

Guarantee
Kafka's Advanced API provides the following guarantees:
☆ messages are sent to a specific topic partition by the producer, and messages are appended to this partition in the order in which they are sent. For example, if M1 and M2 messages are sent by the same producer, M1 is sent first, the M1 offset will be smaller and earlier in the log than M2.
☆ Consumer Instances View records in the order in which they are stored in the journal.
☆ If the number of copies of a topic is n, we can tolerate N-1 servers failing without losing any records submitted to the log.

More details about the guarantee will be given in the design section of the document.

Kafka as a messaging system

How the Kafka flow concept is compared with the traditional enterprise messaging system.
There are always two modes of message processing: Queue and publish-subscribe. In the queue, the consumer pool can be read from the server, and each record goes to one of the consumers; in a publish-subscribe, the record is broadcast to all consumers. Both of these models have advantages and disadvantages. The advantage of queuing is that it allows you to distribute data to multiple consumer processes, thus extending your processing. Unfortunately, the queue does not support multiple subscribers-once a process reads the data, other processes cannot access it. Publish-subscribe allows you to broadcast data to multiple processes, but cannot scale processing because each message is sent to each subscriber.
Kafka's consumer group summed up these two concepts. As with queues, consumer groups allow you to assign processing processes to multiple processes (consumer group members). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
The advantage of the Kafka pattern is that each theme has these properties-it can be extended or can support multiple subscribers-there is no need to choose two more.
Kafka also has a stronger sequential assurance capability than traditional messaging systems.
Traditional queues store records sequentially on the server, and if multiple consumers consume from the queue, the servers distribute the records in the order in which they are stored. However, although servers are distributed sequentially, records are delivered asynchronously to consumers, and they may be out of order when they reach different consumers. This actually means that the order of the records is lost in the case of parallel consumption. Messaging systems typically have an "exclusive consumer" concept, allowing only one process to be consumed from the queue, but this of course means that there is no parallel processing power.
Kafka to do better. By partitioning within the theme, Kafka can provide order assurance and load balancing in the consumer process pool. This is done by assigning the partitions in the topic to consumers in the consumer group so that each partition is used by only one consumer in the group. By doing this, we ensure that consumers are the only readers of the partition and use the data sequentially. Because there are many partitions, this can still balance the load on many consumer instances. Note, however, that consumer instances in a consumer group cannot be more than partitions.

Kafka as a storage system

Any message queue is capable of decoupling the production and consumption of messages, as well as efficiently storing messages that are being delivered. The difference between Kafka is that it is a very good storage system.
Kafka writes data to disk and replicates it for fault tolerance. Kafka allows the producer to wait for confirmation until the copy and persistence are all complete before the write is considered successful.
The disk structure used by Kafka scales well-regardless of whether you have 50KB or 50TB of persisted data on the server, Kafka performs the same action policy.
Because of the importance of storage and allowing customers to control their own read locations, you can treat Kafka as a dedicated distributed file system dedicated to high performance, low latency, guaranteed log storage, backup and self-replication.
For more information on KAFK log storage and replication design, please read this page.

Kafka Stream Processing

Read-only, write-and-store data streams are not enough to process data streams in real time.
In Kafka, a stream processor is a continuous stream of data from an input topic, performing some processing on the input and generating a continuous stream of data to the output topic.
For example, a retail application might accept an input stream for sales and shipping, and output a series of reordering and price adjustments based on this data.
Simple processing can be done directly using the producer API and the consumer API. However, for more complex transformations, Kafka provides a fully integrated streams API. This allows the building application to perform extraordinary processing, separating the calculations from the stream or joining the stream together.
This tool helps solve the challenges faced by this type of application: Handling scrambled data, re-processing input of code changes, performing stateful computations, and so on.
The Streams API is built on the core primitives provided by Kafka: It uses the producer API and consumer API inputs, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance across stream processor instances.

combine the functions.

This combination of messaging, storage, and streaming may seem unusual, but it is important for Kafka as a role in the streaming platform.
Distributed file systems such as HDFs allow the storage of static files for batch processing. Such systems allow the storage and processing of past historical data.
Traditional enterprise messaging systems allow messages to be processed before they arrive. Applications built in this way process future data.
Kafka combines these two functions, and these two combinations are important for Kafka as a streaming application platform and streaming data pipeline.
By combining storage and low-latency subscriptions, streaming applications can handle past and future data in the same way. That is, a single application can process historical, stored data instead of ending when it reaches the last record, and it can continue processing as future data arrives. This is a general concept of stream processing, including batch processing and message-driven applications.
Similarly, for streaming data pipelines, subscribing to real-time events allows Kafka to be used for very low latency pipelines, and the ability to reliably store data can be used for critical data that must guarantee data delivery, or to integrate with offline systems that can only load data periodically, or to maintain downtime for a longer period of time. The stream processing facility can be converted when data arrives.
For more information about the guarantees, APIs, and features provided by Kafka, see the remaining documentation.

Original link: Http://kafka.apache.org/intro

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kafka Getting Started Guide

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kafka Getting Started Guide

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support