Kafka Basic principles and Java simple use

Source: Internet
Author: User
Tags naming convention zookeeper rabbitmq
Apache Kafka Learning (i): Kafka Fundamentals 1, what is Kafka.

Kafka is a messaging system that uses Scala, originally developed from LinkedIn, as the basis for LinkedIn's active stream (activity stream) and operational data processing pipeline (Pipeline). It has now been used by several different types of companies as multiple types of data pipelines and messaging systems.

Kafka is a distributed, publish/subscribe messaging system.

Kafka uses zookeeper as its distributed coordination framework, which combines the process of message production, message storage, and message consumption. At the same time, with the help of Zookeeper,kafka, producers, consumers and broker, the component can establish the subscription relationship between producer and consumer, and realize the load balance between producers and consumers. 2, the characteristics of Kafka

(1) The time Complexity of O (1) in the way to provide message persistence, even for terabytes above data can guarantee constant time complexity of access performance.

(2) High throughput rate. Even on a very inexpensive commercial machine can do stand-alone support 100K per second message transmission.

(3) Support the message partitioning between Kafka servers and distributed consumption, while ensuring the sequence of messages stored and transmitted within each partition.

(4) Both off-line data processing (Offline) and real-time data processing (Online) are supported.

(5) Scale out: Support Online horizontal expansion. The machine can be extended without downtime.

(6) Support the regular deletion of data mechanisms. You can delete it by time period or by document size.

(7) Consumer uses the Pull method to consume the data, the consumption state is controlled by the consumer, reduces the broker burden. 3. Kafka Architecture

(1) Broker: Similar to the broker concept in RABBITMQ. A Kafka server is a broker, and a Kafka cluster contains one or more broker. Broker will persist the data into the corresponding partition without the cache pressure.

(2) Topic: Theme. There is a category for each message, and this category is called topic. Kafka's topic can be understood as RABBITMQ queue message queues, messages of the same category are sent to the same topic and then consumed by topic consumer. Topic is a logical concept, and the physical realization is partition.

(3) Partition: partition. Partitioning is a physical concept, each topic contains one or more partition, and each partition is an ordered queue . Messages sent to topic are partitioned (customizable) to determine which partition the message is stored in. Each piece of data will be assigned an ordered id:offset. Note: Kafka only guarantees that the message is sent to the consumer in the order of a partition, and does not guarantee the order of a topic whole (multiple partition).

(4) Replication: Backup. Replication is based on partition rather than topic. Each partition has its own backup and is distributed across different broker.

 

(5) Offset: offsets. Kafka storage files are named with offset, the advantage of using offset to name is easy to find. For example, if you want to find a location in 2049, just find 2048.log files. Of course the one offset is 00000000000.log. Note: The offset in each partition is an orderly sequence that does not affect each other from 0.

 

(6) Producer: message producer.

(7) Consumer: News consumer. Consumer uses pull to get messages from broker and consumer to maintain consumer status, so the focus of business in the Kafaka system is generally consumer, rather than doing most things like RABBITMQ.

(8) Consumer Group: consumer groups. Each consumer belongs to a specific consumer group (you can specify group name for each consumer, or the default group if you do not specify group name). Each topic can be subscribed by multiple group, and multiple consumer can be in each group. A message sent to topic will be consumed by one consumer in each group, and multiple consumer will be interleaved to consume the entire topic message to achieve load balancing.

(9) Record: message. Each message is composed of a key, a value, and a timestamp.

Note : Messages in the same partition can only be consumed by one consumer in the same group, but a consumer can consume messages from multiple partitions at the same time. When the number of consumers exceeds the number of partition, the surplus consumers will be free. That is to say, if there is only one partition how many consumer you start in the same group, the number of partition determines the degree to which the topic can be balanced in the same group, such as partition= 4, you can be in the same group by a maximum of 4 consumer balanced consumption.

Kafka internal structure diagram (Pictures from Network)

Kafka topology Map (Pictures from Network) 4, Topic, partition file storage 4.1, Topic and partition relationship

Topic can logically be considered a queue, each consumption must specify its topic, and can be simply understood as having to indicate which queue to put the message in. To enable Kafka throughput to be linearly enhanced, the topic is physically divided into one or more partition, and each partition is physically corresponding to a folder that stores all messages and index files for this partition. If you create a topic1 and Topic2 two topic and have 13 and 19 partitions, a total of 32 folders will be generated on the entire cluster. The Partiton naming convention is topic name + ordered ordinal number, the first Partiton ordinal number starts from 0, the ordinal maximum value is partitions quantity minus 1.

4.2, the characteristics of partition file storage

(1) Each partition directory is equivalent to a mega file that is evenly distributed across multiple equal size segment data files. However, the number of each segment file message is not necessarily equal, this feature facilitates the old segment file quickly deleted.

(2) Each Partiton only need to support sequential read and write, the segment file lifecycle is determined by the server-side configuration parameters.

(3) Segment file composition: consists of 2 large parts, respectively, index file (suffix ". Index") and data file (suffix ". Log"), this 2 file one by one corresponds, pair appears.

(4) Segment file naming rules: Partition the first segment of the global starting from 0, followed by each segment file named the offset value of the last message in the previous segment file. The maximum value is 64-bit long, 19-bit digit character length, and no number is populated with 0.

Segment file Map (image from Network)

Taking the example of a pair of segment file files in Figure 2 above, the corresponding physical structure of index and log files in segment is shown as follows:

Index and log files corresponding map (image from the network)

An example of metadata 3,497 in an index file, in which the 3rd message is represented in the data file (at the global partition, and No. 368772 messages), and the physical offset address of the information is 497. 4.3. How to find message via offset in partition

For example, to read the offset=368776 message, you need to find it by following 2 steps.

(1) First step to find segment file

The illustration above, where 00000000000000000000.index represents the first file, the starting offset (offset) is 0. The message quantity starting message for the second file 00000000000000368769.index is 368770 = 368769 + 1. Similarly, the starting message for the third file 00000000000000737337.index is 737338=737337 + 1, which allows you to quickly navigate to a specific file by looking for a binary list of files based on offset. Navigate to 00000000000000368769.index|log when offset=368776.

(2) The second step is to find the message by Segment file

Navigate to segment file through the first step, and when offset=368776, Navigate to the 00000000000000368769.index physical location of the metadata and the physical offset of 00000000000000368769.log, and then search through 00000000000000368769.log order until offset =368776 so far. 4.4. Partition distribution rules in Kafka cluster

First, let's look at a command to create topic under Linux:

bin/kafka-topics.sh--create--zookeeper ip1:2181,ip2:2181,ip3:2181,ip4:2181--replication-factor 2--partitions 4-- Topic test

The meaning of this command is to create a topic named Test on the Kafka cluster of four broker, and have 4 partitions 2 backups (where it's easier to mix, 2 replication say leader and follower add up to 2 altogether). At this point there are 8 partition on four machines, as shown in the figure.

Kafka cluster partition distribution Figure 1 (image from Network)

When 2 new nodes are added to the cluster, the partition is increased to 6 when the distribution is as follows:

Kafka cluster partition distribution Figure 2 (image from Network)

In the Kafka cluster, each broker has an equal distribution of leader partition opportunities.

In the above diagram broker partition, the arrow points to a copy, taking Partition-0 as an example: Broker1 parition-0 is a copy of Leader,broker2 in Partition-0. Each broker (in Brokerid order) assigns the main partition in turn, the next broker is a replica, so the iteration is iterated, and multiple replicas follow this rule.

replica allocation algorithm :

(1) Sort all n broker and I partition to be assigned.

(2) Assign the first partition to the first (i mod n) broker.

(3) Assign the J copy of the first partition to the ( (i + j) mod n) broker

For example, the third partition:partition-2 in Figure 2 will be assigned to Broker3 ((3 mod 6) =3), and a copy of Partition-2 will be assigned to Broker4 ((3+1) mod 6=4).

4.5, Kafka file storage characteristics

(1) Kafka topic a parition large file into a number of small file segments, through a number of small file segments, it is easy to periodically clear or delete the consumption of the file, reduce disk occupancy. You can set the segment file size to be deleted periodically and the message expiration time to be deleted periodically

(2) The message can be positioned quickly by index information.

(3) Through the index metadata all map to memory, you can avoid segment file IO disk operation.

(4) Sparse storage through the index file can greatly reduce the size of the index file metadata footprint. 4.6, the relationship between consumer and partition

For multiple partition, multiple consumer

(1) If consumer more than partition, is wasteful, because the Kafka design is on a partition is not allowed concurrency, so consumer number is not greater than partition number.

(2) if

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.