Initial knowledge of Apache Kafka+java program Example

Source: Internet
Author: User

This article is from the English official website to pick up the translation, uses as own collation and the record. The level is limited, please correct me. Version is: kafka_2.10-0.10.0.0
  

First, the basic concept
    • Subject: Kafka maintains feeds of messages in categories called topics.
        
    • Producer: We'll call processes this publish messages to a Kafka topic producers.
           
    • Consumer: We'll call processes this subscribe to topics and process the feed of published messages consumers.
           
    • Agent (Broker): Kafka is run as a cluster comprised of one or more servers each of the which is called a Broker.

      The producer sends the message to the Kafka cluster via the network, and the cluster (turns) the service message arrives at the consumer. Kafka run in a cluster, each server in the cluster is called an agent.

    • Partition:partition is a physical concept, and each Topic contains one or more Partition.
Topics and logs

A topic is a message that is named or classified as published. For each theme, Kafka holds a partition log that looks like the picture below.
  
Each partition is ordered, and the fixed-length message queue continues to increase to – a commit log. The message is assigned a sequential ID called offset within the partition, which uniquely identifies each message in the partition.
Kafka saves all messages that have been released (over time-configurable)-whether they have been consumed or not. For example, if the log retention is set to two days, after a message is published, it is available within two days, and it will be discarded to the free space two day later

In fact, the metadata remains in each consuming process, based on the position of the consuming process in the log, which is called the "offset" (in fact the metadata retained on a per-consumer basis is the position of The consumer in the log, called the "offset". This offset is controlled by the consumer: when the normal consumer reads the message, linearly increases the offset, but in fact the consumer can control it in any order. For example: A consumer can reset to the previous offset position to be re-processed.
This combination of features means that Kafka consumers are cheap-the consumer process can be increased at any time, without any impact on the cluster and other consumer processes. For example, you can use the command-line tool to output the content of any topic without changing the consumption of any existing consumer.
The partitioning service in the log serves several purposes. First, the size of the log can be adjusted, far more than just one on a server. Each individual partition must be installed on a server on the host, and a topic can have many partitions, so it can handle any number of data. Second, they are all independent and parallel to each other.

Distribution (distribution)

The partitions of the logs are distributed across the servers in the Kafka cluster, each processing the data and requesting a copy of the partition's contents. For fault tolerance, the number of replicas per partition can be set by the server.
 
Each partition has a server that acts as "leader" and 0 to more servers as "followers". Leader handles all read and write requests, while followers passively replicates the leader. If leader fails, one of the "followers" will automatically become the new "leader".
 

Producers

Producers publish data to the subject they choose. The producer is responsible for choosing which partition the message is assigned to that topic. The choice of which partition can be a simple cyclic way to achieve load balancing can also be partitioned according to semantic functions.

Consumers

Each consumer identifies itself to a consumer group, and when each message is posted to the subject, the message is posted to a consumption instance in each subscription consumer group. Consumer instances can be in different processes or on different machines.
If all consumer instances have the same consumer group, then this is like a traditional queue.
If all consumer instances have different consumer groups, then such works are published as subscriptions, and all information is broadcast to all consumers.
However, it is more common that the topic has a small number of consumer groups, each of which is a "logical subscription." Each group is made up of many consumption instances for extensibility and fault tolerance.
Kafka has a stronger order guarantee than the traditional messaging system.
Traditional queues keep sequential messages on the server, if multiple consumers consume from the queue, and the server sends the messages they store in order. However, while the server sends messages sequentially, message delivery is sent asynchronously to the consumer, so the message may be out of order when it arrives at the consumer. This efficiency means that the order of messages is lost in the parallel consumption process. Messaging systems often revolve around this work, there is a "exclusive consumer" concept that allows only one process to consume from one queue, but of course this means there is no possibility of parallelism.
Kafka to do better. By partitioning the subject, Kafka is able to guarantee both order and load-balanced consumption. This is done by partitioning the theme and then giving the consumer group, each of which is consumed by the unique consumption process within the group. By doing this, we ensure that the elimination process is the only one that reads that partition and consumes the data in the order. Please note that in a consumer group, there can be no more consumption process than partitioning.

Kafka only the messages in one partition provide a total order, not between different partitions in a topic. However, if you need a fully ordered message, which can be achieved through a theme and a partition, it will obviously mean that each consumer group has only one consuming process.
 

Guarantees (Guarantee)

Kafka gives the following assurances:

    • The message that the producer sends to a specific topic's partition will be added and sent in sequence.
    • Each consumption instance sees that the message is in sequence and is stored in the log.
    • A topic by n each copy backup, we will tolerate N-1 server failure without losing any information submitted to the log.
       
Ii. Examples of programs

Important to come, the above can not understand the matter, see the program, the most direct.
If we have a theme called Foo, it has 4 partitions. I set up two consumer groups Groupa and GroupB
 
Among them, Groupa has 2 consumers, GROUPB has 4 consumers.
Our producers write content on average to 4 partitions. Cases:

Package part;Import Java. Util. Properties;import org. Apache. Kafka. Clients. Producer. Kafkaproducer;import org. Apache. Kafka. Clients. Producer. Producer;import org. Apache. Kafka. Clients. Producer. Producerrecord;public class Testproducer {public static void main (string[] args) {Properties props = new Properties ();Props. Put("Bootstrap.servers","localhost:9092");The"All"Setting we have specified would resultinchBlocking on the ' full commit ' of the record, the slowest but most durable setting.         The "All" setting causes the full commit of the record to be blocked, the slowest but the most persistent setting. Props. Put("ACKs","All");If the request fails, the producer will also automatically retry, even if set to 0 the producer can automatically retry. Props. Put("Retries",0);The producer maintains buffers of unsent records for each partition. Props. Put("Batch.size",16384);The default is sent immediately, here is the delay millisecond number props. Put("linger.ms",1);Producer buffer size, when the buffer is exhausted, additional send calls will be blocked. Time exceeds Max. Block. MsWill throw timeoutexception props. Put("Buffer.memory",33554432);The key. Serializer  andValue. SerializerInstruct how to turn the key andValue objects The user provides with their producerrecord into bytes. Props. Put("Key.serializer","Org.apache.kafka.common.serialization.StringSerializer");Props. Put("Value.serializer","Org.apache.kafka.common.serialization.StringSerializer");Create Kafka producer class Producer<string, string> Producer = new kafkaproducer<string, string> (props);Main method of producer//Close ();//close this producer.Close (long timeout, timeunit timeunit);//this method waits up to timeout for the producer to complete the sending of all incomplete requests.Flush ()all cached records are sent immediately.for (int i =0; i <; i++)This is an average of 4 partitions written here producer. Send(New producerrecord<string, string> ("foo", i%4, Integer. toString(i), Integer. toString(i)));Producer. Close();}}

Consumers

Package part;Import Java. Util. Arrays;Import Java. Util. Properties;import org. Apache. Kafka. Clients. Consumer. Consumerrecord;import org. Apache. Kafka. Clients. Consumer. Consumerrecords;import org. Apache. Kafka. Clients. Consumer. Kafkaconsumer;public class Testconsumer {public static void main (string[] args) {Properties props = new Properties ();Props. Put("Bootstrap.servers","localhost:9092");System. out. println("This was the group part test 1");Consumer's group ID props. Put("Group.id","Groupa");//This is Groupa or GROUPB.Props. Put("Enable.auto.commit","true");Props. Put("auto.commit.interval.ms","+");Processing time from poll (pull) props. Put("session.timeout.ms","30000");Number of poll limits//props. Put("Max.poll.records"," the");Props. Put("Key.deserializer","Org.apache.kafka.common.serialization.StringDeserializer");Props. Put("Value.deserializer","Org.apache.kafka.common.serialization.StringDeserializer");kafkaconsumer<string, string> consumer = new kafkaconsumer<string, string> (props);Subscribe to the topic list topic Consumer. Subscribe(Arrays. Aslist("foo"));while (true) {consumerrecords<string, string> records = Consumer. Poll( -);For (consumerrecord<string, string> record:records)//Normal here should use thread pool processing, should not handle System here. out. printf("offset =%d, key =%s, value =%s", record. Offset(), Record. Key(), Record. Value()+"\ n");}    }}

If both Groupa and GROUPB start normally, then the message data of 4 consumer average consumption producers in GROUPB (here each 25 messages), Groupa 2 consumers each processing 50 messages, each consumer processing 2 partitions. If one consumer hangs up within Groupa, then the other processes all the message data. If GROUPB hangs a one, then there will be a consumer to deal with the pending message data.
The following command can modify the partition size of a topic.

bin/kafka-topics.sh--zookeeperlocalhost:2181--alter--topicfoo--partitions4
Third, Multi-broker cluster

Here in fact and zookeeper mechanism by the point similar, also established a leader and a few follower. The primary role is also for scalability and fault tolerance. When any one of the problems, can ensure that the system is correct and stable. Even if there is a problem with leader, they can also generate new leader by voting. Here is a brief explanation.

In its official example, a pseudo-cluster service was established locally by copying the original configuration file.

> cp Config/server . Properties Config/server -1.  properties> cp config/server . Properties Config/ Server -2.  Propertiesconfig/server -1.  Properties:broker.id=1  listeners=plaintext://:9093< /span> log.dir=/tmp/kafka-logs-1  config/server -2.  Properties:broker.id=2  listeners=plaintext://:9094< /span> log.dir=/tmp/kafka-logs-2   

Where the Broker.id attribute is the only and permanent node name in the cluster, it should normally be a service for a machine. The other two are because the reason for the pseudo-cluster must be modified.
Let's start these two services to establish a pseudo-cluster. After simulating the leader failure (forced kill), it can also work properly.
Start:

> bin/kafka-server-start.sh config/server-1.properties &...> bin/kafka-server-start.sh config/server-2.properties &
Iv. Typical application scenarios
    1. Monitoring: The host sends metrics related to system and application health via Kafka, which are then collected and processed to create a monitoring dashboard and send a warning. In addition, LinkedIn uses Apache Samza to implement a rich call graph analysis system that can handle events in real time.
    2. Traditional message: The degree of application uses Kafka as a traditional messaging system to implement standard queue and message publishing-subscriptions, such as search and content feeds.
    3. Analytics: To better understand user behavior and improve the user experience, LinkedIn sends information such as which pages users see, what they click, to Kafka clusters in each data center, and analyzes and generates daily reports through Hadoop.
    4. As a component of a distributed application or platform (log): Big Data Warehouse Solution Pinot and other products will Kafka as a core component (distributed log), distributed database Espresso as an internal copy and change the propagation layer.

English Original address: Http://kafka.apache.org/documentation.html#quickstart

Initial knowledge of Apache Kafka+java program instance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.