Author: xumingming | may be reproduced, but must be in the form of hyperlinks to indicate the original source and author information and copyright notice
New features of Twitter storm: Transactional topology is asked the most question is: How does storm know that a bolt processing has completed all of its tuple? There are still a lot of things to do about it, and fortunately storm has provided a bolt to help us get rid of it. This awesome bolt is.
Coordinatedbolt. What is important is that the
CoordinatedBolt implementation is also based on Storm's primitive: spout, bolt, which means that even if the author does not provide it, we can do it ourselves. Let's take a look at the implementation principle of this class.
Although Coordinatedbolt play a very good role, but in fact its principle is not very complex. It is now used in two scenarios:
- Transactional topology
Before looking at
CoordinatedBolt the principle, we first see what is called "finished", in the end what finished?
In fact Coordinatedbolt for the business is not completely non-intrusive, to use the features provided by Coordinatedbolt, you have to ensure that each of your bolts sent each tuple's first field is
request-id , then the so-called "done" It means that the current bolt is done with the current "Request-id" work to be done. This
DRPC represents a DRPC request in the inside, and in transactional topology represents a batch.
The principle of Coordinatedbolt is this:
- For the user in the Drpc, transactional topology inside the bolt, have been coordinatedbolt packaging a layer: that is, DRPC, transactional Topology inside of the topology inside the run is not the user to provide the original bolt, but a bunch of coordinatedbolt, coordinatedbolt these bolts of the transaction agent.
- With this proxy layer, Coordinatedbolt can do its job.
- It maintains some of the following data on its own:
- Which upstream task will send me a tuple? (The grouping information provided when constructing the topology can be learned)
- Which downstream task do I send a tuple to? (also through grouping information can be learned)
- Each coordinatedbolt, after each real Bolt sends a tuple, records which task the tuple is sent to.
- After all of its tuples have been sent out (how do you know it's done?) Later, Wulf), it tells all of the task that it sent a tuple by another special stream in Emitdirect way, it sends the number of tuples to it.
- A bolt, after receiving all the tuple information sent by the upstream task, compares the number of tuples it receives, and if the number is on it, it receives all the tuple-it has done.
- In this way it is done, it can repeat the above steps to inform its downstream, its downstream to inform its downstream downstream and so on.
- To summarize, how does each tuple know that it has finished its processing? is on its upstream notice. So as long as a bolt has upstream, it will be able to know when to complete.
- There is always a bolt that has no upstream-the top bolt. So how does this bolt know that he's done with it? Relying on the storm's ACK system-as long as it ack its upstream (a non-coordinatedbolt, in DRPC is preparerequest) sent over the tuple, it completes the processing of this tuple. -that is to say, for the top Bolt, it just finishes processing a tuple (many tuples are processed relative to its downstream)
Specific principles such as:
As we discuss the concept of what is called "done," we say that
CoordinatedBolt the use of the business is intrusive: you have to take the first field of each tuple in your current
request-id , or you
CoordinatedBolt will not be able to track it. A more elegant way is the network protocol stack inside the IP, TCP protocol processing way. IP packets in the TCP packet on the outer bread on the IP layer needs information, and does not require the IP layer needs to be doped in the TCP packet field, TCP layer in the sending of data only the TCP of those fields, to the IP layer automatically add IP layer information. The IP layer also automatically removes the IP layer information before it passes the packet to the TCP layer, and TCP will only see those fields of its own layer, without intrusion. The author has introduced some improvement measures to this problem here
Storm Primer (11) Twitter Storm source code Analysis Coordinatedbolt