Kafka Consumer API Example 1. Auto-confirm Offset
Description Reference: http://blog.csdn.net/xianzhen376/article/details/51167333
Properties Props = new properties ();/* Defines the address of the KAKFA service and does not require all brokers to be specified on */props. put ("Bootstrap.servers","localhost:9092");/* Develop consumer group */props. put ("Group.id","Test");/* Whether to automatically confirm the offset */props. put ("Enable.auto.commit","true");/* Automatically confirms the time interval for offset */props. put ("Auto.commit.interval.ms","1000");p Rops. put ("Session.timeout.ms","30000");/* Serialization class */props for key. put ("Key.deserializer","Org.apache.kafka.common.serialization.StringDeserializer");/* Serialization class for value */props. put ("Value.deserializer","Org.apache.kafka.common.serialization.StringDeserializer");/* Define consumer */kafkaconsumer<string, string> consumer = new Kafkaconsumer<> (props)/* Consumer subscription topic, can subscribe to multiple */consumer.subscribe (Arrays< Span class= "Hljs-preprocessor" >.aslist ( "foo", "bar")) /* read data, read time-out is 100ms */while (true) {consumerrecords<string, string> records = Consumer.poll (100) ; for ( Consumerrecord<string, string> record:records) system.out.printf (" offset =%d, key =%s, value =%s ", Record.offset (), Record.key (), record .value ())
Description
1. bootstrap.servers only represents the connection entry for Kafka, only one broker in the cluster should be specified;
2. Once the consumer and KAKFA cluster establish a connection, consumer will be the heartbeat of the way to the high-speed cluster itself is still alive, if the session.timeout.ms heartbeat does not reach the server, the server thinks the heartbeat is lost, will do rebalence.
2. Manually control the offset
If the consumer needs to be processed after the data has been obtained and the data is complete before the offset is confirmed, the program is required to control the acknowledgement of the offset. Give me a chestnut:
Once the data has been consumer, the data needs to be persisted into the db. Automatically confirming the offset, if the data is read from the Kafka cluster, it is confirmed, but the persistence process fails, resulting in data loss. We need to control the confirmation of offset.
Properties Props=New Properties ();p Rops. put ("Bootstrap.servers","localhost:9092");p Rops. put ("Group.id","Test");/* Turn off automatic confirmation option */props. put ("Enable.auto.commit","false");p Rops. put ("Auto.commit.interval.ms","Rops");p. put ("Session.timeout.ms","30000");p Rops. put ("Key.deserializer","Org.apache.kafka.common.serialization.StringDeserializer");p Rops. put ("Value.deserializer","Org.apache.kafka.common.serialization.StringDeserializer"); Kafkaconsumer<String,String> Consumer=New Kafkaconsumer<> (props); consumer. Subscribe (Arrays. Aslist ("Foo","Bar")); final int minbatchsize=200;List<consumerrecord<String,String>> Buffer=New ArrayList<> ();while (true) {Consumerrecords<string, string> Records = Consumer. Poll; for (Consumerrecord< String, string> Record: records) {buffer. Add (record);} /* Data reaches volume requirements, write to DB, synchronize to confirm offset */ if (buffer. Size () >= minbatchsize) {insertintodb (buffer); Consumer. Commitsync (); Buffer. Clear ();}}
It is also possible to fine-tune the specific offset data for specific partitions to confirm:
try {while (running) {consumerrecords<string, string> records = Consumer. Poll (Long. Max_value); for (Topicpartition partition:records. partitions ()) {list<consumerrecord<string, string>> partitionrecords = Records. Records (partition); For (consumerrecord<string, string> record:partitionrecords) {System.out. println (record. Offset () + ":" + Record. Value ());} /* Synchronize to confirm specific offset for a partition */long LastOffset = Partitionrecords. Get (partitionrecords. Size ()- 1). Offset (); consumerCommitsync (CollectionsSingletonmap (partition, new Offsetandmetadata (LastOffset + 1)));}}} finally {consumer. Close ();}
Note: The confirmed offset is the maximum offset+1 of accepted data.
3. Partitioned subscriptions
A message can be subscribed to a specific partition. But it loses partion load sharing. There are several scenarios where you might play this way:
1. Only need to get the partition data of the local disk;
2. The program itself or external programs can implement their own load and error handling. For example Yarn/mesos intervention, when consumer hung up, then start a consumer.
String topic = "foo";TopicPartition partition0 = new TopicPartition(topic, 0);TopicPartition partition1 = new TopicPartition(topic, 1);consumer.assign(Arrays.asList(partition0, partition1));
Description
1. This situation uses the consumer Group and does not load balance.
2. Topic subscriptions and partitioned subscriptions cannot be mixed in the same consumer.
4. External storage Offset
Consumers can customize the offset storage location of the Kafka. The main purpose of this design is to allow consumers to store data and offset in an atomic capacity. This avoids the recurring consumption problem mentioned above. Lifting instructions:
Subscribe to a specific partition. When you store the obtained records, the offset of each record is stored together. The storage of data and offset is guaranteed to be atomic. When asynchronous storage is interrupted by an exception, there is a corresponding offset record for the data that has been stored. This way, there is no data loss and no duplication of reads from the server.
How to configure the implementation:
1. Go to enable offset auto confirm: enable.auto.commit=false;
2. Obtain offset from the Consumerrecord and save it;
3. When consumer restarts, call Seek (topicpartition, long) to reset the consumption record on the server.
If the consumer partition is also customized, this approach will be very cool. If the partition is automatically assigned, it should be considered clearly when the partition occurs reblance. If, for reasons such as upgrade, the partition drifts to a consumer that does not update offset, then the dog is on the day.
In this case:
1. The original consumer needs to listen to the partition revocation event and confirm the offset when it is revoked. Interface: consumerrebalancelistener.onpartitionsrevoked (Collection);
2. The new consumer listens for partition allocation events and gets the offset that is consumed by the current partition. Interface: consumerrebalancelistener.onpartitionsassigned (Collection);
3. Consumer hear the Consumerrebalance event, the cached data that has not yet been processed or persisted is flush out.
5. Control the consumption location
In most cases, the consumer consumption location on the server side is confirmed intermittently by the client. Kafka allows consumer to set its own consumption starting point, to achieve the effect:
1. Can consume data that has already been consumed;
2. Can jump the consumption data;
Take a look at some of these scenarios:
1. For consumer, the data is time-sensitive, only need to obtain the latest period of data, you can jump to obtain data;
2. The above self-stored offset scene, restart will need to start from the designated location to consume.
The interface above has been mentioned, with Seek (topicpartition, long). 、
Hemp eggs, said the pointer is not good, this section is redundant DAO.
6. Controlling consumer flow consumption flow control
If a consumer consumes multiple partitions at the same time, by default, the priority of the multiple partitions is the same, and consumption is the same. Kafka provides a mechanism that allows the consumption of certain partitions to be paused before the contents of other partitions are obtained. Scene Lifting:
1. Streaming calculation, consumer consumes two topic at the same time, then joins the data of two topic. However, the data generated in these two topic have a large rate gap. Consumer need to control the logic, first get the slow topic, read the data slowly and then read fast.
2. The same number of topic are consumed simultaneously, but consumer start up is that there are a large number of topic data already in the area. At this point, we can prioritize the consumption of other topic.
Control means: Let a partition consumption first pause, the time to recover, and then then poll. Interface: Pause (topicpartition ...), resume (topicpartition ...)
7. Multithreaded processing Model multi-threaded processing
The interface of the Kafka consumer is non-thread-safe. multithreaded shared Io,consumer threads need to do their own thread synchronization.
If you want to terminate consumer immediately, the only way is to use the calling interface: Wakeup (), which causes the processing thread to produce wakeupexception. Look at the bricks:
PublicClassKafkaconsumerrunnerImplementsRunnable {/* Note that these two goods are class member variables */PrivateFinal Atomicboolean closed =New Atomicboolean (FALSE);PrivateFinal Kafkaconsumer consumer;Publicvoid run () {try { Consumer.subscribe (arrays.aslist (while (!closed.get ()) {Consumerrecords records = Consumer.poll ( 10000); //Handle New Records}} catch (wakeupexception e) {//Ignore exception if closing if (!closed.get ()) throw e;} finally {consumer.close ();}} //Shutdown hook which can be called from a separate thread public Span class= "Hljs-keyword" >void shutdown () {Closed.set (true); Consumer.wakeup (); }}
Description
1. Kafkaconsumerrunner is runnable, please consciously brain multi-threaded operation;
2. External thread control Kafkaconsumerrunner thread stop;
3. The main word is that multithreading consumes the same topic, rather than consuming the same partition;
Compare these two models:
Consumer single threading Model
Advantages: easy to achieve;
Pros: No collaboration between threads. Usually faster than the one below;
Advantages: Sequential processing of single partition data;
Cons: Multiple TCP connections, but the relationship is not big, Kafka to their own server is full of confidence;
Disadvantage: Too many request may cause the server to reduce the throughput of a loss;
Disadvantage: The number of consumer is limited by the number of partitions, one consumer a partition;
Consumer multithreaded Model
Advantage: A consumer any number of threads, the number of threads is not limited by the number of partitions;
Disadvantage: If there is a demand for order, you have to add control logic;
Disadvantage: If you manually offset the model, you have to add control logic;
A viable workaround: Allocate separate storage for each partition, and the obtained data is hashed based on the partition where the data resides. This solves the problem of sequential consumption, and the confirmation of offset.
Postscript
In fact, for the official online said, I was confused:
When comparing the two threading models, there should be a hidden map.
1. Single-threaded model, in the case of multiple partitions, it should be said that each consumer independent to consume a partition;
2. In a multithreaded model, a single consumer consumes a topic. If multiple threads consume the same partition at the same time, that is, to have a public connection, synchronize each thread;
3. For the multi-threaded model of the proposed client partition data stored separately, how each partition is saved?
Kafka Consumer API Example