Storm starter-singlejoinexample

Source: Internet
Author: User
Tags emit

Storm common mode-stream Aggregation

Topology

1. Define two spouts, genderspout and agespout.
Fields, ("ID", "gender"), ("ID", "Age"), the final join result should be ("ID", "gender ", "Age ")

2. When setting singlejoinbolt, outfields must be used as the parameter to tell bolts which fields should be included in the join result
Both spouts are fields ("ID") fieldsgrouping to ensure that the same ID is sent to the same task.

public class SingleJoinExample {    public static void main(String[] args) {        FeederSpout genderSpout = new FeederSpout(new Fields("id", "gender"));        FeederSpout ageSpout = new FeederSpout(new Fields("id", "age"));                TopologyBuilder builder = new TopologyBuilder();        builder.setSpout("gender", genderSpout);        builder.setSpout("age", ageSpout);        builder.setBolt("join", new SingleJoinBolt(new Fields("gender", "age")))                .fieldsGrouping("gender", new Fields("id"))                .fieldsGrouping("age", new Fields("id"));}

Singlejoinbolt

Because bolts cannot guarantee that they can receive all tuple of an ID at the same time, they must cache the received tuple in memory first, until they receive all tuples of an ID, and then join.

After the join operation, the tuple can be deleted from the cache. However, if some tuple of an ID is lost, other tuples of this ID will be cached all the time.

To solve this problem, set timeout for the cache data, delete the data after expiration, and send the Fail notifications of these tuples.

It can be seen that timecachemap is suitable for this scenario,

Timecachemap <list <Object>, , Map

List <Object>: The joined field. For the above example, "ID", the reason is list, which should be to support multi-fields join.

Map <globalstreamid, tuple>, record the relationship between tuple and stream

In this example, extract the following two K and V from the bucket of timecachemap, and then perform the join operation.

{ID, {agestream, (ID, age )}}

{ID, {genderstream, (ID, gender )}}

 

1. Prepare

The General prepare logic is very simple, but it is really complicated here...

A. set timeout and expirecallback.

Timeout is set to config. topology_message_timeout_secs. The default value is 30 s, which can be adjusted according to the scenario.

We should try to ensure the sending order of tuple in different spouts to ensure that tuple with the same ID is received at a short interval. For example, this example should be sorted by ID and then emit

Otherwise, if ("ID", "gender") is the first emit, and ("ID", "Age") is the last emit

Sets expirecallback to send a fail notification for all timeout tuples.

    private class ExpireCallback implements TimeCacheMap.ExpiredCallback<List<Object>, Map<GlobalStreamId, Tuple>> {        @Override        public void expire(List<Object> id, Map<GlobalStreamId, Tuple> tuples) {            for(Tuple tuple: tuples.values()) {                _collector.fail(tuple);            }        }            }

B. Find out the relationship between _ idfields (which fields are the same and can be used as join) and _ fieldlocations (the relationship between outfield and spout stream, for example, gender belongs to genderstream)

Use context. getthissources () to retrieve the spout sources list and getcomponentoutputfields to obtain the fields list.

_ Idfields: the logic is very simple. Every time we take the new fields and idfields for retainall (retrieve the common part of the Set), we will get the same part of all spout fields.

_ Fieldlocations: match them with _ outfields and spout fields. Find the link recorded after the match.

In fact, I think this part of preparation work can be specified with parameters during the call, so it is not so troublesome to do it.

For example, if the parameter is changed to ("ID", {"gender", genderstream}, {"Age", agestream })

    @Override    public void prepare(Map conf, TopologyContext context, OutputCollector collector) {        _fieldLocations = new HashMap<String, GlobalStreamId>();        _collector = collector;        int timeout = ((Number) conf.get(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS)).intValue();        _pending = new TimeCacheMap<List<Object>, Map<GlobalStreamId, Tuple>>(timeout, new ExpireCallback());        _numSources = context.getThisSources().size();        Set<String> idFields = null;        for(GlobalStreamId source: context.getThisSources().keySet()) {            Fields fields = context.getComponentOutputFields(source.get_componentId(), source.get_streamId());            Set<String> setFields = new HashSet<String>(fields.toList());            if(idFields==null) idFields = setFields;            else idFields.retainAll(setFields);                        for(String outfield: _outFields) {                for(String sourcefield: fields) {                    if(outfield.equals(sourcefield)) {                        _fieldLocations.put(outfield, source);                    }                }            }        }        _idFields = new Fields(new ArrayList<String>(idFields));                if(_fieldLocations.size()!=_outFields.size()) {            throw new RuntimeException("Cannot find all outfields among sources");        }    }

2, execute

A. Retrieve _ idfields and streamid from tuple.
If _ idfields is not found in _ PENDING (timecachemap), create a new hashmap for _ idfields and put it to the bucket.

B. Retrieve all Map <globalstreamid, tuple> parts corresponding to this _ idfields, and check whether the received tuple is invalid (tuple with the same ID from the same stream emit)

Put the new tuple to the map. parts. Put (streamid, tuple) corresponding to the _ idfields );

C. Determine if the size of parts is equal to the number of spout sources. For example 2, it means that when the tuple from genderstream and agestream has been received

Delete the cache data of this _ idfields from _ PENDING (timecachemap) Because join is available and you do not need to wait.

And extract the value from the tuple of each stream according to _ outfields and _ fieldlocations.

Final emit result, (ID, age), (ID, gender), (age, gender ))

Arraylist <tuple> (parts. Values (), joinresult

Ack all tuple

    @Override    public void execute(Tuple tuple) {        List<Object> id = tuple.select(_idFields);        GlobalStreamId streamId = new GlobalStreamId(tuple.getSourceComponent(), tuple.getSourceStreamId());        if(!_pending.containsKey(id)) {            _pending.put(id, new HashMap<GlobalStreamId, Tuple>());                    }        Map<GlobalStreamId, Tuple> parts = _pending.get(id);        if(parts.containsKey(streamId)) throw new RuntimeException("Received same side of single join twice");        parts.put(streamId, tuple);        if(parts.size()==_numSources) {            _pending.remove(id);            List<Object> joinResult = new ArrayList<Object>();            for(String outField: _outFields) {                GlobalStreamId loc = _fieldLocations.get(outField);                joinResult.add(parts.get(loc).getValueByField(outField));            }            _collector.emit(new ArrayList<Tuple>(parts.values()), joinResult);                        for(Tuple part: parts.values()) {                _collector.ack(part);            }        }    }

 

Timecachemap

Storm common mode-timecachemap

What problems can be solved?

It is often necessary to cache key-value in memory, for example, to quickly find a table.

However, memeory is limited, so you only want to keep the latest cache. Expired key-value can be deleted. therefore, timecachemap is used to solve this problem. In a certain period of time, cache map (Kv set)

1. construct parameters

TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback)

First, expirationsecs is required to indicate how long it will expire.

Then, numbuckets indicates the time granularity. For example, if expirationsecs is 60 s and numbuckets is 10, a bucket represents the 6 s time window, and 6 s will delete expired data once.

Finally, expiredcallback <K, V> callback. When timeout occurs, you need to perform some operations on the expired K and V. You can define this callback, such as sending a fail notification.

2. Data Member

Core Structure: implements the bucket list using the inventory list, and implements each bucket using hashmap <K, V>.

private LinkedList<HashMap<K, V>> _buckets; 

Auxiliary Member, Lock Object and regular cleaner thread

private final Object _lock = new Object();private Thread _cleaner;

3. Constructor

In fact, the core is to start the _ cleaner daemon thread.

_ Cleaner logic is actually very simple,

Regularly Delete the last bucket and add a new bucket at the beginning of the bucket list. If callback is defined, call callback for all timeout kV pairs.

Thread security is also considered here, and synchronized (_ Lock) will be locked during the operation)

The only thing that needs to be discussed is sleeptime.

That is, if the data is deleted after the defined expirationsecs time

Definition, sleeptime = expirationmillis/(numBuckets-1)

A. If cleaner just deletes the last bucket and adds the first bucket, the expiration time of the put K and V is,

Expirationsecs/(numBuckets-1) * numbuckets = expirationsecs * (1 + 1/(numBuckets-1 ))

You need to wait for the complete numbuckets sleeptime, so the time will be slightly greater than the expirationsecs

B. If cleaner starts the clean operation after the put K and V Operations are completed, the expiration time of K and V is,

Expirationsecs/(numBuckets-1) * numbuckets-expirationsecs/(numBuckets-1) = expirationsecs

This case will have less sleeptime than a, and the time is exactly expirationsecs

Therefore, this method ensures that the data will be deleted within the interval [B, ].

    public TimeCacheMap(int expirationSecs, int numBuckets, ExpiredCallback<K, V> callback) {        if(numBuckets<2) {            throw new IllegalArgumentException("numBuckets must be >= 2");        }        _buckets = new LinkedList<HashMap<K, V>>();        for(int i=0; i<numBuckets; i++) {            _buckets.add(new HashMap<K, V>());        }        _callback = callback;        final long expirationMillis = expirationSecs * 1000L;        final long sleepTime = expirationMillis / (numBuckets-1);        _cleaner = new Thread(new Runnable() {            public void run() {                try {                    while(true) {                        Map<K, V> dead = null;                        Time.sleep(sleepTime);                        synchronized(_lock) {                            dead = _buckets.removeLast();                            _buckets.addFirst(new HashMap<K, V>());                        }                        if(_callback!=null) {                            for(Entry<K, V> entry: dead.entrySet()) {                                _callback.expire(entry.getKey(), entry.getValue());                            }                        }                    }                } catch (InterruptedException ex) {                }            }        });        _cleaner.setDaemon(true);        _cleaner.start();    }

4. Other operations

First, all operations will use synchronized (_ Lock) to ensure thread mutex

Second, the complexity of all operations is O (numbuckets), because each item is hashmap and is an O (1) operation.

 

The most important thing is put. Only the new K, V, will be put to the first (that is, the latest) bucket, and the cache data of the same key in the old bucket will be deleted.

public void put(K key, V value) {        synchronized(_lock) {            Iterator<HashMap<K, V>> it = _buckets.iterator();            HashMap<K, V> bucket = it.next();            bucket.put(key, value);            while(it.hasNext()) {                bucket = it.next();                bucket.remove(key);            }        }    }

The following operations are also supported,

Public Boolean containskey (K key)

Public v get (K key)

Public object remove (K key)

Public int size () // accumulate the size of the hashmap of all buckets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.