Kafka: A sharp tool for large data processing _

Kafka: A sharp tool for large data processing __c language

Last Update:2018-07-26 Source: Internet

Author: User

Tags message queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Currently, the Alliance message push Platform Log service daily receives more than two billion of requests, expect the year-end daily average request to break 6 billion. This one, had to mention a large data processing tool: Kafka. What Kafka is. Is the author of the novel "Metamorphosis". In fact, today's Kafka is a very popular open source software, if you pay attention to the recent two years of technology development trends, then you will find that Kafka will often appear in the major technology sharing conference.

Accurately, Kafka is a distributed, high throughput, persistent log service that LinkedIn developed in 2010, using the development team's own words: "We designed Kafka to is able to act as a unified platform for Handling the real-time data feeds a large company might have. "In the beginning, Kafka was only used for the collection and transmission of massive logs within LinkedIn, but because of its sophisticated production and consumption models and powerful Performance, once open source, immediately get a wide range of applications, has been the mainstream Internet companies as a data pipeline or message system in use.

In Kafka, data is pushed through Producer (producer) to Broker (Kafka cluster) and then pull to individual data pipelines or other business tiers through Consumer (consumer). In this process, the data is persisted on the Kafka hard disk, and each data processing process can independently consume the data. Kafka's extension structure is shown in the diagram (from InfoQ):

It is worth mentioning that the Kafka consumption model realizes both the queue model and the publish subscription (PUB-SUB) model. By grouping, multiple consumers can consume data that is mutually exclusive, and the Kafka is similar to a high-performance message queue. In addition, multiple consumers can subscribe to the same data, which makes it easy to separate a new branch from the existing data pipeline.
Kafka from the beginning of the design is based on the first line of internet company background services can produce the amount of data designed, so performance is very excellent:
Kafka data throughput is extremely high. The single machine supports the 100k order of magnitude QPS. Kafka data is persisted to the hard disk, and the time Complexity is O (1). TB-level data can also be accessed at constant time for performance. KAKFA is a distributed system that supports no downtime level expansion systems.
The Kafka produced by the factory can be used as high performance message middleware, but Kafka is used in various heterogeneous data processing platforms. In the alliance message push platform, Kafka is used as a data bus. Kafka is directly responsible for persisting massive logs from the mobile-side SDK and decoupling it from the back-end data processing process. Real-time computing, off-line computing and other business-specific data services all take Kafka as the source of data. Among them, real time computation uses Spark, calculates some developers to be concerned about such as the service rate, the opening rate and so on data index. Offline computing uses the MR task to compute some user tags and device properties.

For Kafka monitoring in the production environment, you can use the Yahoo Kafka Manager based on the play Framework. Of course, if you only focus on a few core indicators such as data accumulation in the Kafka, you can also use Kafka system tools. Here is an example of viewing Kafka queue stacking:

As shown in the figure, the group Id,topic and zookeeper connections are specified using the Kafka consumeroffsetchecker, and the Lag column in the result represents the number of stacked messages for each Kafka in the partition.

In particular, it is necessary to note that the data in the Kafka unit is message, distributed service consumption message process, Kafka does not guarantee the timing of the message. If you want to ensure that timing, you need to implement a consistent Hash algorithm, will need to ensure that the timing of the message according to a specified key to the same partition. Developers are welcome to focus on our microblogging "Friends of the league push" to get more about product dynamics, technology sharing topics.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More