Storm common mode-stream Aggregation
Topology
1. Define two spouts, genderspout and agespout.
Fields, ("ID", "gender"), ("ID", "Age"), the final join result should be ("ID", "gender ", "Age ")
2. When setting singlejoinbolt, outfields must be used as the parameter to tell bolts which fields should be included in the join result
Both spouts are fields ("ID") fieldsgrouping to ensure that the same ID is sent to the same task.
public class SingleJoinExample { public static void main(String[] args) { FeederSpout genderSpout = new FeederSpout(new Fields("id", "gender")); FeederSpout ageSpout = new FeederSpout(new Fields("id", "age")); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("gender", genderSpout); builder.setSpout("age", ageSpout); builder.setBolt("join", new SingleJoinBolt(new Fields("gender", "age"))) .fieldsGrouping("gender", new Fields("id")) .fieldsGrouping("age", new Fields("id"));}
Singlejoinbolt
Because bolts cannot guarantee that they can receive all tuple of an ID at the same time, they must cache the received tuple in memory first, until they receive all tuples of an ID, and then join.
After the join operation, the tuple can be deleted from the cache. However, if some tuple of an ID is lost, other tuples of this ID will be cached all the time.
To solve this problem, set timeout for the cache data, delete the data after expiration, and send the Fail notifications of these tuples.
It can be seen that timecachemap is suitable for this scenario,
Timecachemap <list <Object>, , Map
List <Object>: The joined field. For the above example, "ID", the reason is list, which should be to support multi-fields join.
Map <globalstreamid, tuple>, record the relationship between tuple and stream
In this example, extract the following two K and V from the bucket of timecachemap, and then perform the join operation.
{ID, {agestream, (ID, age )}}
{ID, {genderstream, (ID, gender )}}
1. Prepare
The General prepare logic is very simple, but it is really complicated here...
A. set timeout and expirecallback.
Timeout is set to config. topology_message_timeout_secs. The default value is 30 s, which can be adjusted according to the scenario.
We should try to ensure the sending order of tuple in different spouts to ensure that tuple with the same ID is received at a short interval. For example, this example should be sorted by ID and then emit
Otherwise, if ("ID", "gender") is the first emit, and ("ID", "Age") is the last emit
Sets expirecallback to send a fail notification for all timeout tuples.
private class ExpireCallback implements TimeCacheMap.ExpiredCallback<List<Object>, Map<GlobalStreamId, Tuple>> { @Override public void expire(List<Object> id, Map<GlobalStreamId, Tuple> tuples) { for(Tuple tuple: tuples.values()) { _collector.fail(tuple); } } }
B. Find out the relationship between _ idfields (which fields are the same and can be used as join) and _ fieldlocations (the relationship between outfield and spout stream, for example, gender belongs to genderstream)
Use context. getthissources () to retrieve the spout sources list and getcomponentoutputfields to obtain the fields list.
_ Idfields: the logic is very simple. Every time we take the new fields and idfields for retainall (retrieve the common part of the Set), we will get the same part of all spout fields.
_ Fieldlocations: match them with _ outfields and spout fields. Find the link recorded after the match.
In fact, I think this part of preparation work can be specified with parameters during the call, so it is not so troublesome to do it.
For example, if the parameter is changed to ("ID", {"gender", genderstream}, {"Age", agestream })
@Override public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _fieldLocations = new HashMap<String, GlobalStreamId>(); _collector = collector; int timeout = ((Number) conf.get(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS)).intValue(); _pending = new TimeCacheMap<List<Object>, Map<GlobalStreamId, Tuple>>(timeout, new ExpireCallback()); _numSources = context.getThisSources().size(); Set<String> idFields = null; for(GlobalStreamId source: context.getThisSources().keySet()) { Fields fields = context.getComponentOutputFields(source.get_componentId(), source.get_streamId()); Set<String> setFields = new HashSet<String>(fields.toList()); if(idFields==null) idFields = setFields; else idFields.retainAll(setFields); for(String outfield: _outFields) { for(String sourcefield: fields) { if(outfield.equals(sourcefield)) { _fieldLocations.put(outfield, source); } } } } _idFields = new Fields(new ArrayList<String>(idFields)); if(_fieldLocations.size()!=_outFields.size()) { throw new RuntimeException("Cannot find all outfields among sources"); } }
2, execute
A. Retrieve _ idfields and streamid from tuple.
If _ idfields is not found in _ PENDING (timecachemap), create a new hashmap for _ idfields and put it to the bucket.
B. Retrieve all Map <globalstreamid, tuple> parts corresponding to this _ idfields, and check whether the received tuple is invalid (tuple with the same ID from the same stream emit)
Put the new tuple to the map. parts. Put (streamid, tuple) corresponding to the _ idfields );
C. Determine if the size of parts is equal to the number of spout sources. For example 2, it means that when the tuple from genderstream and agestream has been received
Delete the cache data of this _ idfields from _ PENDING (timecachemap) Because join is available and you do not need to wait.
And extract the value from the tuple of each stream according to _ outfields and _ fieldlocations.
Final emit result, (ID, age), (ID, gender), (age, gender ))
Arraylist <tuple> (parts. Values (), joinresult
Ack all tuple
@Override public void execute(Tuple tuple) { List<Object> id = tuple.select(_idFields); GlobalStreamId streamId = new GlobalStreamId(tuple.getSourceComponent(), tuple.getSourceStreamId()); if(!_pending.containsKey(id)) { _pending.put(id, new HashMap<GlobalStreamId, Tuple>()); } Map<GlobalStreamId, Tuple> parts = _pending.get(id); if(parts.containsKey(streamId)) throw new RuntimeException("Received same side of single join twice"); parts.put(streamId, tuple); if(parts.size()==_numSources) { _pending.remove(id); List<Object> joinResult = new ArrayList<Object>(); for(String outField: _outFields) { GlobalStreamId loc = _fieldLocations.get(outField); joinResult.add(parts.get(loc).getValueByField(outField)); } _collector.emit(new ArrayList<Tuple>(parts.values()), joinResult); for(Tuple part: parts.values()) { _collector.ack(part); } } }
Timecachemap
Storm common mode-timecachemap
What problems can be solved?
It is often necessary to cache key-value in memory, for example, to quickly find a table.
However, memeory is limited, so you only want to keep the latest cache. Expired key-value can be deleted. therefore, timecachemap is used to solve this problem. In a certain period of time, cache map (Kv set)
1. construct parameters
TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback)
First, expirationsecs is required to indicate how long it will expire.
Then, numbuckets indicates the time granularity. For example, if expirationsecs is 60 s and numbuckets is 10, a bucket represents the 6 s time window, and 6 s will delete expired data once.
Finally, expiredcallback <K, V> callback. When timeout occurs, you need to perform some operations on the expired K and V. You can define this callback, such as sending a fail notification.
2. Data Member
Core Structure: implements the bucket list using the inventory list, and implements each bucket using hashmap <K, V>.
private LinkedList<HashMap<K, V>> _buckets;
Auxiliary Member, Lock Object and regular cleaner thread
private final Object _lock = new Object();private Thread _cleaner;
3. Constructor
In fact, the core is to start the _ cleaner daemon thread.
_ Cleaner logic is actually very simple,
Regularly Delete the last bucket and add a new bucket at the beginning of the bucket list. If callback is defined, call callback for all timeout kV pairs.
Thread security is also considered here, and synchronized (_ Lock) will be locked during the operation)
The only thing that needs to be discussed is sleeptime.
That is, if the data is deleted after the defined expirationsecs time
Definition, sleeptime = expirationmillis/(numBuckets-1)
A. If cleaner just deletes the last bucket and adds the first bucket, the expiration time of the put K and V is,
Expirationsecs/(numBuckets-1) * numbuckets = expirationsecs * (1 + 1/(numBuckets-1 ))
You need to wait for the complete numbuckets sleeptime, so the time will be slightly greater than the expirationsecs
B. If cleaner starts the clean operation after the put K and V Operations are completed, the expiration time of K and V is,
Expirationsecs/(numBuckets-1) * numbuckets-expirationsecs/(numBuckets-1) = expirationsecs
This case will have less sleeptime than a, and the time is exactly expirationsecs
Therefore, this method ensures that the data will be deleted within the interval [B, ].
public TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback) { if(numBuckets<2) { throw new IllegalArgumentException("numBuckets must be >= 2"); } _buckets = new LinkedList<HashMap<K, V>>(); for(int i=0; i<numBuckets; i++) { _buckets.add(new HashMap<K, V>()); } _callback = callback; final long expirationMillis = expirationSecs * 1000L; final long sleepTime = expirationMillis / (numBuckets-1); _cleaner = new Thread(new Runnable() { public void run() { try { while(true) { Map<K, V> dead = null; Time.sleep(sleepTime); synchronized(_lock) { dead = _buckets.removeLast(); _buckets.addFirst(new HashMap<K, V>()); } if(_callback!=null) { for(Entry<K, V> entry: dead.entrySet()) { _callback.expire(entry.getKey(), entry.getValue()); } } } } catch (InterruptedException ex) { } } }); _cleaner.setDaemon(true); _cleaner.start(); }
4. Other operations
First, all operations will use synchronized (_ Lock) to ensure thread mutex
Second, the complexity of all operations is O (numbuckets), because each item is hashmap and is an O (1) operation.
The most important thing is put. Only the new K, V, will be put to the first (that is, the latest) bucket, and the cache data of the same key in the old bucket will be deleted.
public void put(K key, V value) { synchronized(_lock) { Iterator<HashMap<K, V>> it = _buckets.iterator(); HashMap<K, V> bucket = it.next(); bucket.put(key, value); while(it.hasNext()) { bucket = it.next(); bucket.remove(key); } } }
The following operations are also supported,
Public Boolean containskey (K key)
Public v get (K key)
Public object remove (K key)
Public int size () // accumulate the size of the hashmap of all buckets