From this beginning, we will enter the analysis of consumer. Like producer, consumer is also divided between the old Scala version and the new Java Edition, where we only analyze the new Java edition.
Interested friends can pay attention to the public number "the way of architecture and technique", get the latest articles.
or scan the following QR code:
Before analyzing, let's take a look at the basic usage of consumer:
properties props = new Properties ();
Props.put ("Bootstrap.servers", "localhost:9092");
Props.put ("Group.id", "Test");
Props.put ("Enable.auto.commit", "true");
Props.put ("auto.commit.interval.ms", "1000");
Props.put ("session.timeout.ms", "30000");
Props.put ("Key.deserializer", "Org.apache.kafka.common.serialization.StringDeserializer");
Props.put ("Value.deserializer", "Org.apache.kafka.common.serialization.StringDeserializer");
kafkaconsumer<string, string> consumer = new kafkaconsumer<> (props); Consumer.subscribe (Arrays.aslist ("foo", "Bar")); Core Function 1: Subscribe topic while (TRUE) {consumerrecords<string, string> records = Consumer.poll (100);//Core function 2:l Ong poll, one pull back to multiple messages for (consumerrecord<string, string> record:records) System.out.printf ("O
Ffset =%d, key =%s, value =%s ", Record.offset (), Record.key (), Record.value ()); }
consumer non-thread-safe
As we mentioned earlier, Kafkaproducer is thread-safe, and multiple threads can share a producer instance. But consumer is not.
In almost all the functions of kafkaconsumer, we will see this:
Public Consumerrecords<k, v> poll (long timeout) {
acquire (); The acquire/release here is not for multi-line loads locks, just the opposite: to prevent multi-threaded calls. If a multi-threaded call is found, the interior will throw an exception directly
...
Release ();
}
Consumer group– load Balancing mode vs. Pub/sub mode
Each consumer instance, at the time of initialization, all need to pass a group.id, this group.id determines the multiple consumer when consumes the same topic, is the apportionment, or the broadcast.
Assuming that multiple consumer are subscribed to the same topic, this topic has multiple partition.
Load Balancing mode: Multiple consumer belong to the same group, topic corresponding partition messages are distributed to these consumer.
Pub/sub mode: Multiple consumer belong to different group, then all messages of this topic will be broadcast to each group. Partition Auto Assign vs. manually specify
In the above load balancing mode, we call the Subscrible function, specify only topic, do not specify partition, at this time, partition will automatically be allocated in all corresponding consumer of this group.
Another way is to force the specified consumer which partion to consume which topic, using the Assign function.
public void Subscribe (list<string> topics) {
Subscribe (topics, new Noopconsumerrebalancelistener ());
}
public void assign (list<topicpartition> partitions) {
...
}
One key point is this: these 2 modes are mutually exclusive, using subscribe, you cannot use assign. Vice versa.
In the code, these 2 modes are stored in 2 different variables, respectively:
public class Subscriptionstate {
...
private final set<string> subscription; Corresponds to subscrible mode
private final set<topicpartition> userassignment;//corresponds to assign mode
}
Similarly, when calling subscrible or assign in the code, there is a corresponding check. If mutual exclusion is found, it throws an exception. Consumer confirmation-consume offset vs. committed offset
As we mentioned earlier, "Consumer acknowledgement" is a problem that all message middleware solves: After getting the message, processing it, sending an ACK to the message middleware, or saying confirm.
There will be 2 consumption locations, or 2 offset values: One is the consume offset of the current fetch message, one is the processing complete, and the ACK is determined after the committed offset is sent.
Obviously, in asynchronous mode, committed offset is lagging behind consume offset.
Here's a key point: if consumer hangs up, it will re-consume from the committed offset position instead of the consume offset position. This means that it is possible to repeat the consumption
Of the 0.9 clients, there are 3 types of ACK policies:
Strategy 1: Automatic, Periodic ack. The way the demo shows above:
Props.put ("Enable.auto.commit", "true");
Props.put ("auto.commit.interval.ms", "1000");
Policy 2:consumer.commitsync ()//Call Commitsync, manually synchronize Ack. 1 messages per processing, Commitsync 1 times
Policy 3:consumer. Commitasync ()//manual Asynchronous Ack exactly once– save offset
As we have said earlier, Kafka only guarantees that the message does not leak, that is, at lease once, without guaranteeing that the message is not heavy.
Repeat send: This client can not solve, need the server to judge the weight, the price is too big.
Repeat consumption: With the above Commitsync (), we can send a commitsync of 1 messages each time we finish processing. So is it possible to solve the "repeat consumption"? Just like the following code:
while (true) {
consumerrecords<string, string> records = Consumer.poll (+);
For (consumerrecord<string, string> record:records) {
buffer.add (record);
}
if (Buffer.size () >= minbatchsize) {
insertintodb (buffer); Eliminate processing, save to DB
Consumer.commitsync (); Synchronous Send Ack
buffer.clear ();
}
}
The answer is in the negative. Because the above Insertintodb and Commitsync do not have atomic operations: if the data processing is complete, the Commitsync is dead, the server restarts again, the message will still be repeated consumption.
What is the solution to the problem?
The answer is to save committed offset, instead of relying on Kafka's cluster to save committed offset, to manipulate the message and save offset into an atomic operation.
In the official document of Kafka, the following 2 types of usage scenarios for saving offset are listed:
relational databases, accessed through transactions. Consumer hangs, restarts, and messages don't repeat consumption If the results of the consumption is being stored in a relational database, storing the offset in The database as well can allow committing both the results and offset with a single transaction. Thus either the transaction would succeed and the offset would be updated based on what was consumed or the result would not
be stored and the offset won ' t is updated. Search engine: Put offset along with the data, built in the index if the results is being stored in a local store it is possible to store the offset there a s well. For example a search index could is built by subscribing to a particular partition and storing both the offset and the IND Exed data together. If the IS-done in a-to-a-atomic, it's often possible to has it being the case that even if a crash occurs that Caus Es unsync ' d data to be lost, whatever are left with the corresponding offset stored as well. This means, the indexing process, the comes back have lost recent updates just resumes indexing FRom what it have ensuring that no updates is lost.
At the same time, the official said, to save the offset, you need to do the following several operations
Configure enable.auto.commit=false //disable Auto ACK Use the
offset provided with each Consumerrecord to save your Positi On. Each time the message is taken, the corresponding offset is saved on
Restart restore the position of the consumer using seek (topicpartition, long).//Next reboot, Using the Consumer.seek function, locate your saved offset and start spending
Through the above approach, we have reached the consumer end of the "exactly Once", in the consumer, the message is not lost, not heavy.
Further put producer + consumer together thinking, if there is a consumer end of the exactly Once, plus db of the weight, even if the sender has "duplicate send", there is no problem.