Kafka Guide _kafka

Source: Internet
Author: User
Tags file copy flush sendfile zookeeper

Refer to the message system, currently the hottest Kafka, the company also intends to use Kafka for the unified collection of business logs, here combined with their own practice to share the specific configuration and use. Kafka version 0.10.0.1

Update record 2016.08.15: Introduction to First draft

As a suite of large data for cloud computing, Kafka is a distributed, partitioned, replicable messaging system. There are basic features, but also have their own characteristics: to topic as a unit of message induction to topic release message is producer from topic get Messages is consumer cluster mode of operation, each service is called Broker client and server through TCP Communication

In the Kafka cluster, there is no "hub master" concept, all servers in the cluster are peer-to-peer, so you can add and remove servers without making any configuration changes, and producers and consumers of the same message will be able to reboot and the machine's upper and lower lines at will.

For each topic, KAFKA partitions it, and each partition consists of a sequence of sequential, immutable messages that are appended to the partition sequentially. Each message in the partition has a sequential serial number called offset, which is used to uniquely identify the message in the partition.

There are usually two modes of publishing messages: Queue mode (queuing) and publish-subscribe mode (publish-subscribe). In queue mode, consumers can read messages from the server at the same time, and each message is read only by one of the consumer; The message is broadcast to all consumer in the publish-subscribe mode. More often, each topic has a number of consumer groups, each of which is a logically "subscriber", and each group consists of several consumer for fault tolerance and better stability. This is actually a publish-subscribe model, except that the Subscriber is a group rather than a single consumer.

By the concept of partitioning, Kafka can provide better ordering and load balancing in the case of multiple consumer groups concurrency. Distribute each partition only to one consumer group, so that a partition is consumed by only one consumer of the group, and the message of this partition can be consumed sequentially. Because there are multiple partitions, it is still possible to load balance between multiple consumer groups. Note that the number of consumer groups cannot be more than the number of partitions, that is, how many partitions allow for concurrent consumption.

Kafka can only guarantee the ordering of messages within a partition, which is not possible between different partitions, which already satisfies the needs of most applications. If you need the order of all messages in the topic, you can only have one partition for this topic, and of course, only one consumer group consumes it. Stand-alone configuration

Follow these steps (from the official website tutorial)

1. Download Kafka download wget http://apache.01link.hk/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz or wget http://ftp.cuhk.edu.hk/pub/ PACKAGES/APACHE.ORG/KAFKA/0.10.0.0/KAFKA_2.11-0.10.0.0.TGZ (see which source is faster) unzip Tar-xzf kafka_2.11-0.10.0.0.tgz into the folder CD kafka_2.11-0.10.0.0/

2. Start the service start Zookeeper bin/zookeeper-server-start.sh Config/zookeeper.properties & (use & put to the background for easy to continue operation) Start Kafka bin/ kafka-server-start.sh Config/server.properties &

3. Create a topic called Dawang, which has only one partition, one replica to create bin/kafka-topics.sh--create--zookeeper localhost:2181--replication-factor 1-- Partitions 1--topic dawang view bin/kafka-topics.sh--list--zookeeper localhost:2181 can also configure broker to automatically create topic

4. Send a message. Kafka uses a simple command-line producer to read messages from a file or from standard input and send them to the server. Each of the default commands will send a message. Send Message bin/kafka-console-producer.sh--broker-list localhost:9092--topic Dawang (then you can enter content, carriage return can send, CTRL + C exit)

5. Start consumer. Can read messages and output to standard output: Receive message bin/kafka-console-consumer.sh--zookeeper localhost:2181--topic Dawang--from-beginning run in a terminal Consumer command line, another terminal to run the producer command line, you can enter messages in one terminal, another terminal read the message. Both commands have their own optional parameters, and you can see Help information without any parameters at run time.

6. Build a cluster of multiple broker, start a cluster of 3 broker, these broker nodes are also in the native

First copy the configuration file: CP config/server.properties config/server-1.properties and CP config/server.properties config/ Server-2.properties

Two files that need to be changed include:

Config/server-1.properties:broker.id=1 listeners=plaintext://:9093 log.dir=/tmp/kafka-logs-1 config/ server-2.properties:broker.id=2 listeners=plaintext://:9094 log.dir=/tmp/kafka-logs-2

Here we have the broker ID, port number and log address configured before Sing Woo, and then we start these two broker:

bin/kafka-server-start.sh Config/server-1.properties & bin/kafka-server-start.sh config/server-2.properties &

Then create a topic with a replication factor of 3

bin/kafka-topics.sh--create--zookeeper localhost:2181--replication-factor 3--partitions 1--topic oh3topic

You can use the describe command to display topic details

> bin/kafka-topics.sh--describe--zookeeper localhost:2181--topic oh3topic topic:oh3topic PartitionCount:1 Replicationfactor:3 Configs:Topic:oh3topic partition:0 leader:0 replicas:0,1,2 isr:0,1,2

Here's a quick explanation. Leader is the node number for a given partition, and some of the data for each partition randomly assigns a different node replicas is a replicated Isr that the log will hold to indicate that replication is synchronizing

We can also take a look at another topic situation before.

> bin/kafka-topics.sh--describe--zookeeper localhost:2181--topic Dawang Topic:dawang PartitionCount:1 Replicationfactor:1 Configs:Topic:dawang partition:0 leader:0 replicas:0 isr:0

Finally, we can produce and consume messages in the same way, for example

# production bin/kafka-console-producer.sh--broker-list localhost:9092--topic oh3topic # consumption bin/kafka-console-consumer.sh- Zookeeper localhost:2181--from-beginning--topic oh3topic

Open two terminals can produce news, while consumer news. Attention matters

If you want to configure a custom port, listeners must be configured as an IP address in server.properties, and if you are configured as a localhost or server hostname, you will throw the data in Java with a different

# Create topic bin/kafka-topics.sh--create--zookeeper bi03:2181--replication-factor 1--partitions 1--topic logs # production message bi n/kafka-console-producer.sh--broker-list localhost:13647--topic Logs # consumer message # bin/kafka-console-consumer.sh- Zookeeper localhost:2181--topic Logs

If zookeeper appears fsync-ing the write ahead log in Syncthread:1 took 2243ms which would adversely effect operation. The Zookeeper troubleshooting Guide is because FOLLOWER is in sync with LEADER, the Fsync operation time is too long, resulting in timeouts. Add Ticktime or Initlimit and Synclimit values to cluster configuration

Kafka uses zookeeper for managing and coordinating agents. Each Kafka agent coordinates other KAFKA agents through zookeeper. When a new agent or an agent fails in the Kafka system, the zookeeper service notifies the producer and the consumer. The producer and the consumer begin to coordinate work with other agents accordingly. Install Java

Install Java for two machines first

sudo add-apt-repository-y ppa:webupd8team/java sudo apt-get update sudo apt-get-y install oracle-java8-installer update Host S

Here use two machines as an example (preferably 3 in theory, even if not, but the zookeeper cluster is more than half the number of downtime to make the entire cluster down, so the odd number of clusters is better), respectively, configure the/etc/hosts file as

127.0.0.1 localhost 10.1.1.164 bi03 10.1.1.44 bi02 profile Modification Zookeeper config file

Modify Config/zookeeper.properties to

datadir=/data/home/logger/kafka_2.11-0.10.0.0/zookeeper-logs/clientport=2181 # maxClientCnxns=0 tickTime=2000 Initlimit=5 synclimit=2 server.1=bi03:13645:13646 server.2=bi02:13645:13646

The meaning of the parameter is: the Initlimit:zookeeper cluster contains multiple servers, one of which is leader, and the remaining server in the cluster is follower. The Initlimit parameter configures the maximum heartbeat time between follower and leader when the connection is initialized. The parameter is set to 5, indicating a time limit of 5 times times ticktime, that is, 5*2000=10000ms=10s Synclimit: This parameter configures the maximum length of time to send a message between leader and follower, and to request and answer. This parameter is set to 2, indicating a time limit of twice times ticktime, or 4000ms server. X=A:B:C where X is a number, which means this is the first date server. A is the IP address of the server. B Configure the ports used by the leader Exchange messages in this server and in the cluster. C Configure the port to use when electing leader. Number to Server

Create a myID file in the DataDir directory, respectively

# Server.1 echo 1 > myID # server.2 echo 2 > myID start zookeeper

Then start the Zookeeper service on each machine

bin/zookeeper-server-start.sh Config/zookeeper.properties &

All the zookeeper of the machine will have an error before it starts, which is normal.

If you don't want any output

Nohup bin/zookeeper-server-start.sh config/zookeeper.properties & Modify Kafka configuration file

Modify the Config/server.properties, a few to change the part is

# Allow deletion of topic Delete.topic.enable=true Broker.id=0 # here can't repeat listeners=plaintext://bi03:13647 # here to configure the host name # of the cost machine here needs to be configured The address and port that the extranet can access advertised.listeners=plaintext://external.ip:8080 log.dirs=/data/home/logger/kafka_2.11-0.10.0.0/ Kafka-logs num.partitions=2 zookeeper.connect=bi03:2181,bi02:2181 Boot Kafka

Executing on each node

bin/kafka-server-start.sh Config/server.properties &

If you don't want any output

Nohup bin/kafka-server-start.sh config/server.properties & Verification Installation

Create a topic

bin/kafka-topics.sh--create--zookeeper bi03:2181,bi02:2181--replication-factor 2--partitions 1--topic test

View cluster status

bin/kafka-topics.sh--describe--zookeeper bi03:2181,bi02:2181--topic test

Production message, note here to produce to the listening port previously set instead of the zookeeper port

bin/kafka-console-producer.sh--broker-list bi03:13647,bi02:13647--topic test

Consumer messages, note here is the zookeeper port, not the Kafka port

bin/kafka-console-consumer.sh--zookeeper bi03:2181,bi02:2181--from-beginning--topic test

Show Topic List

bin/kafka-topics.sh--zookeeper bi03:2181,bi02:2181--list

Delete Topic

bin/kafka-topics.sh--zookeeper bi03:2181,bi02:2181--delete--topic Hello other configuration

The

Kafka is configured using the property file format of a key-value pair, such as  config/server.properties, where the value can be read from a file or specified in code. The most important three properties are: Broker.id:broker number, not the same log.dirs: Log saved folder, default to  /tmp/kafka-logs Zookeeper.connect:zookeeper host

Some of the other things I think are more useful. Auto.create.topics.enable whether to allow automatic creation of Topic,boolean values, default to True auto.leader.rebalance.enable whether to allow leader Automatic balancing, Boolean, default to True background.threads background process number, int value, default to 10 Compression.type Specify topic compression, string value, optional gzip, snap PY, Lz4 compression method uncompressed not compress producer following producer Delete.topic.enable whether to allow the deletion of Topic,boolean values, default to False (primarily for controlling admin Control in the interface) Leader.imbalance.check.interval.seconds checks whether the balance interval, long, defaults to the Leader.imbalance.per.broker.percentage The percentage of the allowable imbalance, which will be balanced, the int value, the default is ten Log.flush.interval.messages saved a number of messages will be the data into the disk, Long value, default is 9223372036854775807 log.flush.interval.ms how long each message will stay in memory before it is saved to disk, in milliseconds, long, if not set, by default using log.flush.scheduler.interval.ms, which is 9223372036854775807

More configuration can be referenced here, the above configuration is for broker, because I only use some of the broker's basic operations

All of the tools can be viewed under the Bin/folder, and if you do not take any arguments, a list of all commands is given, which simply describes some common command creation and removal topic

You can create topic manually, or automatically create nonexistent topic when data comes in, and if you create them automatically, you may need to adjust them accordingly.

Create topic

bin/kafka-topics.sh--zookeeper zk_host:port/chroot--create--topic my_topic_name--partitions 20-- Replication-factor 3--config x=y

Replication-factor control the number of copies, recommend 2-3 to accommodate fault tolerance and efficiency. Partitions control the number of topic to be partitioned, the number of partitions is best not to exceed the server number (because the meaning of the partition is to increase the efficiency of parallelism, and the number of servers determines the number of parallel, assuming that there are only 2 servers, divided into 4 zones and 2 areas in fact, the difference is not In addition, the name of the topic cannot exceed 249 characters

Modify Topic

bin/kafka-topics.sh--zookeeper zk_host:port/chroot--alter--topic my_topic_name--partitions 40

It should be noted here that even if the number of partitions has been modified, the existing data will not be changed, Kafka will not do any automatic redistribution

Add configuration

bin/kafka-topics.sh--zookeeper zk_host:port/chroot--alter--topic my_topic_name--config x=y

Remove configuration

bin/kafka-topics.sh--zookeeper zk_host:port/chroot--alter--topic my_topic_name--delete-config x

Delete Topic

bin/kafka-topics.sh--zookeeper zk_host:port/chroot--delete--topic my_topic_name

This requires delete.topic.enable=true, the current Kafka does not support reducing the number of topic partitions gracefully off

Kafka automatically detects the state of the broker and elects a new leader based on the machine state. But if a configuration change outage is required, we need to use graceful shutdown, the advantage is: all the logs will be synchronized to disk, to avoid restart log recovery, reduce the restart time will be leader in this machine for the partition data to other nodes, will reduce the time unavailable

But this needs to open controlled.shutdown.enable=true.

The node that was just restarted is not a leader of any partitions, so it needs to be reassigned:

bin/kafka-preferred-replica-election.sh--zookeeper Zk_host:port/chroot

We need to open auto.leader.rebalance.enable=true.

You can then use the script bin/kafka-server-stop.sh

Note that if there is no auto.leader.rebalance.enable=true in the configuration file, there is a need to rebalance deep understanding

Here are just a few excerpts, more on the reference link (especially the American Team Technical Blog's article) file system

Kafka relies heavily on file systems to store and cache messages. The file system will eventually be placed on the hard drive, but don't worry, many times the speed of the hard drive depends entirely on the way it is used. A well-designed hard disk architecture can be as fast as memory.

So unlike the traditional design that caches data in memory and then brushes it to a hard disk, Kafka writes data directly into the log of the file system, thus avoiding the JVM's disadvantage--java the space of the object occupies a large amount of data, and the garbage collection has difficulty. With the file system, even if the system reboots, there is no need to refresh the data and simplify the logic of maintaining data consistency.

For a message system that is primarily used for log processing, data persistence can be done simply by appending the data to a file and reading from the file. The advantage of this is that both read and write are O (1), and read operations do not block writes and other operations. The performance benefits are obvious because performance is not related to the size of the data.

Since you can use hard disk space with virtually no capacity limits (relative to memory), you can build a message system that provides features that are not available to some general messaging systems without performance loss. For example, the general message system is deleted immediately after the message is consumed, Kafka can save the message for a period of time (such as a week), which gives consumer a good mobility and flexibility. Transaction definition

The transaction definition of data transfer usually has the following three levels: The message is not sent repeatedly, is transmitted at most once, but it may not be transmitted at one time. At least once: the message will not be sent out, at least once, but it may also be transmitted repeatedly. Exact once (exactly once): no leakage and no repetition, each message is transmitted once and only once, which is what everyone expects.

The Kafka mechanism is a bit like git, with a commit concept that once committed and broker is working, the data is not lost. If a network error occurs when producer publishes a message, but is not sure what happened before or after the submission, this is not uncommon, but it must be taken into account, and the Kafka version has not solved the problem, and future versions are trying to resolve it.

Not all situations require a high level of "exact once", Kafka allows producer to specify a flexible level. For example, producer can specify that a notification must wait for a message to be submitted, or send a message completely asynchronously without waiting for any notification, or just wait for leader to declare that it has received the message (followers is not necessary).

Now consider this issue from the consumer aspect, all replicas have the same log file and the same Offset,consumer maintain the offset of the messages they consume. If consumer crashes, there will be another consumer to consume the message and it needs to continue processing from a suitable offset. In this case, you can have the following selections: Consumer can read the message first, then write offset to the log file, and then process the message. There is a possibility that after the storage of offset has not processed the message crash, the new consumer continue to handle from this offset, then some messages will never be processed, which is said above "at most once" consumer can read messages, processing messages, The final record O Ffset, of course, if crash before the record of offset, the new consumer will repeat the consumption of some messages, which is said to be "at least once" "exact once" can be resolved by dividing the submission into two phases: once the offset is saved, it is submitted once, The message is processed successfully and then submitted again. But there is a simpler approach: keep the offset of the message together with the results of the message being processed. For example, when processing messages with Hadoop ETL, both the processed results and offset are kept in HDFS, which guarantees that the message and the Offser are processed at the same time. Performance optimization

Kafka has made great efforts to improve its efficiency. One of the main uses of Kafka is to process Web site activity logs, which can be very large, and each page will produce multiple write operations. Read, assuming that each message is only consumed once, the amount of reading is also very large, Kafka also try to make reading the operation of more lightweight.

There are about two things that affect disk performance in the case of linear read and write: Too much trivial I/O operations and too many byte copies. I/O problems occur between the client and the server, and also in the persistent operations within the server.

Message set (Messages set)

To avoid these problems, Kafka establishes the concept of message set, which organizes messages together as a unit of processing. Processing messages in units of a message set can improve performance rather than being processed in a single message. Producer sends a message set to the server instead of a piece of sending; the server appends the message set to the log file at once, reducing trivial I/O operations. Consumer can also request a single message set at a one-time.

Another performance optimization is in terms of byte copies. This is not a problem in the case of low load, but it has a significant impact in the case of high load. To avoid this problem, Kafka uses a standard binary message format that can be shared between producer, broker, and producer without making any changes.

Zero copy

The message log maintained by Broker is just some directory files, the message set is written to the log file in a fixed team format producer and consumer are shared, which makes Kafka an important point to optimize: Message delivery on the network. The modern UNIX operating system provides high-performance system functions for sending data from the page cache to the socket, which in Linux is Sendfile

To better understand the benefits of Sendfile, let's take a look at the data flow that typically sends data from a file to a socket: The operating system writes data from the page cache in the file copy kernel from the page caching application to the memory cache from the copy of data to the socket in the kernel The operating system in the cache copies the data from the socket cache to the NIC interface cache, which is sent from here to the network.

This is obviously inefficient, with 4 copies and 2 times system calls. Sendfile through the direct data from the page cache to send the NIC interface cache, avoid duplicate copy, greatly optimize the performance.

In a much consumers scenario, the data is copied only once to the page cache, not every time the message is consumed. This causes the message to be sent out at a rate of near-network bandwidth. This way you can hardly see any reading at the disk level, because the data is sent directly from the page cache to the network.

Data compression

Most of the time, performance bottlenecks are not CPU or hard disks but network bandwidth, especially for applications that need to transfer large amounts of data between data centers. Of course, users can compress their own messages without Kafka support, but this will result in a lower compression rate because the best compression can be achieved by compressing a large number of files together rather than compressing the message separately.

Kafka with end-to-end compression: Because of the concept of "message set", the client's messages can be compressed together and sent to the service side, and written to the log file in a compressed format, sent to consumer in a compressed format, messages from producer sent to consumer are compressed , it is only decompressed when consumer is used, so it is called "end-to-end compression". Kafka supports GZIP and snappy compression protocols. Reference Links Kafka Learning to organize six (server.properties configuration practice) Apache Kafka Quick start Kafka Introductory Classic tutorial Apache Kafka How to work Apache Kafka 0.9.0.1 Cluster environment to build Kafka cluster set up Kafka file storage mechanism those things Kafka principle and design realization idea Kafka design principle Kafka Cluster Operations Guide What is the actual role of zookeeper in Kafka? from:http://wdxtub.com/2016/08/15/kafka-guide/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.