Kafka: A sharp tool for large data processing __c language

Source: Internet
Author: User
Tags message queue
Currently, the Alliance message push Platform Log service daily receives more than two billion of requests, expect the year-end daily average request to break 6 billion. This one, had to mention a large data processing tool: Kafka. What Kafka is. Is the author of the novel "Metamorphosis". In fact, today's Kafka is a very popular open source software, if you pay attention to the recent two years of technology development trends, then you will find that Kafka will often appear in the major technology sharing conference.

Accurately, Kafka is a distributed, high throughput, persistent log service that LinkedIn developed in 2010, using the development team's own words: "We designed Kafka to is able to act as a unified platform for Handling the real-time data feeds a large company might have. "In the beginning, Kafka was only used for the collection and transmission of massive logs within LinkedIn, but because of its sophisticated production and consumption models and powerful Performance, once open source, immediately get a wide range of applications, has been the mainstream Internet companies as a data pipeline or message system in use.

In Kafka, data is pushed through Producer (producer) to Broker (Kafka cluster) and then pull to individual data pipelines or other business tiers through Consumer (consumer). In this process, the data is persisted on the Kafka hard disk, and each data processing process can independently consume the data. Kafka's extension structure is shown in the diagram (from InfoQ):


It is worth mentioning that the Kafka consumption model realizes both the queue model and the publish subscription (PUB-SUB) model. By grouping, multiple consumers can consume data that is mutually exclusive, and the Kafka is similar to a high-performance message queue. In addition, multiple consumers can subscribe to the same data, which makes it easy to separate a new branch from the existing data pipeline.
Kafka from the beginning of the design is based on the first line of internet company background services can produce the amount of data designed, so performance is very excellent:
Kafka data throughput is extremely high. The single machine supports the 100k order of magnitude QPS. Kafka data is persisted to the hard disk, and the time Complexity is O (1). TB-level data can also be accessed at constant time for performance. KAKFA is a distributed system that supports no downtime level expansion systems.
The Kafka produced by the factory can be used as high performance message middleware, but Kafka is used in various heterogeneous data processing platforms. In the alliance message push platform, Kafka is used as a data bus. Kafka is directly responsible for persisting massive logs from the mobile-side SDK and decoupling it from the back-end data processing process. Real-time computing, off-line computing and other business-specific data services all take Kafka as the source of data. Among them, real time computation uses Spark, calculates some developers to be concerned about such as the service rate, the opening rate and so on data index. Offline computing uses the MR task to compute some user tags and device properties.


For Kafka monitoring in the production environment, you can use the Yahoo Kafka Manager based on the play Framework. Of course, if you only focus on a few core indicators such as data accumulation in the Kafka, you can also use Kafka system tools. Here is an example of viewing Kafka queue stacking:

As shown in the figure, the group Id,topic and zookeeper connections are specified using the Kafka consumeroffsetchecker, and the Lag column in the result represents the number of stacked messages for each Kafka in the partition.

In particular, it is necessary to note that the data in the Kafka unit is message, distributed service consumption message process, Kafka does not guarantee the timing of the message. If you want to ensure that timing, you need to implement a consistent Hash algorithm, will need to ensure that the timing of the message according to a specified key to the same partition. Developers are welcome to focus on our microblogging "Friends of the league push" to get more about product dynamics, technology sharing topics.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.