(4) custom Scheme for storm-kafka Source Code Reading

Source: Internet
Author: User
Tags emit iterable

This article is original. For more information, see the source:

KafkaSpout requires sub-classes to implement Scheme. storm-kafka implements StringScheme, KeyValueStringScheme, and so on.

These Scheme are mainly responsible for parsing the required data from the message stream.

Public interface Scheme extends Serializable {public List <Object> deserialize (byte [] ser); public Fields getOutputFields ();}

To implement the deserialization method and output fields name, let's look at the simple implementation of StringScheme:

Public class StringScheme implements Scheme {public static final String STRING_SCHEME_KEY = "str"; public List <Object> deserialize (byte [] bytes) {return new Values (deserializeString (bytes ));} public static String deserializeString (byte [] string) {try {return new String (string, "UTF-8");} catch (UnsupportedEncodingException e) {throw new RuntimeException (e );}} public Fields getOutputFields () {return new Fields (STRING_SCHEME_KEY );}}

In fact, a String is directly returned, and a field is called "str" when Spout is launched later. If StringScheme is used, you can use it in Bolt.

Tuple. getStringByField ("str ")

To obtain its value. Some people may wonder why new SchemeAsMultiScheme (new StringScheme () is used? Let's look at the SchemeAsMultiScheme code.
Public class SchemeAsMultiScheme implements MultiScheme {public final Scheme scheme; public SchemeAsMultiScheme (Scheme scheme) {this. scheme = scheme ;}@ Override public Iterable <List <Object> deserialize (final byte [] ser) {List <Object> o = scheme. deserialize (ser); if (o = null) return null; else return Arrays. asList (o) ;}@ Override public Fields getOutputFields () {return scheme. getOutputFields () ;}} public interface MultiScheme extends Serializable {public Iterable <List <Object> deserialize (byte [] ser); public Fields getOutputFields ();}

In fact, the passed scheme method is called, but the returned results are combined into a list. The younger brother thinks it is not necessary. However, storm-kafka requires scheme by default. scheme information is called when KafkaUtils parses the message:

Public static Iterable <List <Object> generateTuples (KafkaConfig kafkaConfig, Message msg) {Iterable <List <Object> tups; ByteBuffer payload = msg. payload (); if (payload = null) {return null;} ByteBuffer key = msg. key (); if (key! = Null & kafkaConfig. scheme instanceof KeyValueSchemeAsMultiScheme) {tups = (KeyValueSchemeAsMultiScheme) kafkaConfig. scheme ). deserializeKeyAndValue (Utils. toByteArray (key), Utils. toByteArray (payload);} else {tups = kafkaConfig. scheme. deserialize (Utils. toByteArray (payload);} return tups ;}

So there is no big demand. Use storm-kafka by default.


Example

Kafka receives a wide variety of messages and a variety of launching information pages. Therefore, we need to write scheme on our own. The following two examples are provided:


Example 1

First: generally, a field is sent by default. But if I need to send more fields, what should I do? Now, I have launched two fields. In fact, there are already big cows on the Internet, add the offset of kafka to the emission information. The analysis process is as follows:

// Returns false if it's reached the end of current batch public EmitState next (SpoutOutputCollector collector) {if (_ waitingToEmit. isEmpty () {fill ();} while (true) {MessageAndRealOffset toEmit = _ waitingToEmit. pollFirst (); if (toEmit = null) {return EmitState. NO_EMITTED;} Iterable <List <Object> tups = KafkaUtils. generateTuples (_ spoutConfig, toEmit. msg); if (tups! = Null) {for (List <Object> tup: tups) {collector. emit (tup, new KafkaMessageId (_ partition, toEmit. offset);} break;} else {ack (toEmit. offset) ;}} if (! _ WaitingToEmit. isEmpty () {return EmitState. EMITTED_MORE_LEFT;} else {return EmitState. EMITTED_END ;}}

We can see from the above that offset has been used as messageId for transmitting tuple, so we think that the offset can be obtained through messageId in the bolts that receive tuple below, but let's take a look at it. Backtype.storm.daemon.exe cutorCode:

(Log-message "Opening spout" component-id ":" (keys task-datas) (doseq [[task-id task-data] task-datas: let [^ ISpout spout-obj (: objecttask-data) tasks-fn (: tasks-fntask-data) send-spout-msg (fn [out-stream-id values message-id out-task-id] (. increment emitted-count) (let [out-tasks (ifout-task-id (tasks-fnout-task-id out-stream-id values) (tasks-fnout-stream-id values) rooted? (Andmessage-id has-ackers ?) Root-id (ifrooted? (MessageId/generateId rand) out-ids (fast-list-for [t out-tasks] (ifrooted? (MessageId/generateId rand)]


From this code, we can see that messageId is randomly generated. It is consistent with the new KafkaMessageId (_ partition, toEmit. offset) there is no relationship at all, so you need to manually add the offset to the tuple of the launch. This requires us to implement Scheme ourselves. The code is as follows:

Publicclass implements Scheme {public static final String SCHEME_OFFSET_KEY = "offset"; private String _ offsetTupleKeyName; private Scheme _ localScheme; public partition () {_ localScheme = new StringScheme (); _ Scheme = Scheme;} public Scheme (Scheme localScheme, String offsetTupleKeyName) {_ localScheme = localScheme; _ offsetTupleKeyName = offsetTupleKeyName;} public Scheme (Scheme localScheme) {this (localScheme, SCHEME_OFFSET_KEY);} public List <Object> deserialize (byte [] bytes) {bytes (bytes);} publicFields getOutputFields () {List <String> outputFields = _ localScheme. getOutputFields (). toList (); outputFields. add (_ offsetTupleKeyName); returnnew Fields (outputFields );}}



Here, the scheme output is two fields, one is str, and StringScheme is responsible for deserialization, or other scheme is implemented by itself; the other is offset, but how does offset be added to the transmitted tuple ?? We can find the transmitted tuple from PartitionManager.

Public EmitState next (SpoutOutputCollector collector) {if (_ waitingToEmit. isEmpty () {fill ();} while (true) {MessageAndRealOffset toEmit = _ waitingToEmit. pollFirst (); if (toEmit = null) {return EmitState. NO_EMITTED;} Iterable <List <Object> tups = KafkaUtils. generateTuples (_ spoutConfig, toEmit. msg); if (tups! = Null) {for (List <Object> tup: tups) {tup. add (toEmit. offset); collector. emit (tup, new KafkaMessageId (_ partition, toEmit. offset);} break;} else {ack (toEmit. offset) ;}} if (! _ WaitingToEmit. isEmpty () {return EmitState. EMITTED_MORE_LEFT;} else {return EmitState. EMITTED_END ;}}


KafkaUtils. generateTuples (xxx, xxx)

Public static Iterable <List <Object> generateTuples (KafkaConfig kafkaConfig, Message msg) {Iterable <List <Object> tups; ByteBuffer payload = msg. payload (); if (payload = null) {return null;} ByteBuffer key = msg. key (); if (key! = Null & kafkaConfig. scheme instanceof KeyValueSchemeAsMultiScheme) {tups = (KeyValueSchemeAsMultiScheme) kafkaConfig. scheme ). deserializeKeyAndValue (Utils. toByteArray (key), Utils. toByteArray (payload);} else {tups = kafkaConfig. scheme. deserialize (Utils. toByteArray (payload);} return tups ;}


Currently, offset has been successfully added to the transmitted tuple. In bolt, you can use tuple. getValue (1), or tuple. getStringByField ("offset"); or

The only thing to do is to specify the scheme as KafkaOffsetWrapperScheme during SpoutConfig construction.

Example 2

Second, the stored messages in kafka are in other formats, such as thrift, avro, and protobuf. In this way, you need to implement the deserialization process on your own.

The avro scheme format is used as an example here (avro literacy is not allowed here, so google it yourself)

In this case, the message in avro format is stored in kafka. If avro schema is as follows:

{"Namespace": "example. avro "," type ":" record "," name ":" User "," fields ": [{" name ":" name "," type ": "string" },{ "name": "favorite_number", "type": ["int", "null"] },{ "name": "favorite_color ", "type": ["string", "null"]}

Then we need to implement the Scheme interface.

Public class AvroMessageScheme implements Scheme {private final static Logger logger = LoggerFactory. getLogger (AvroMessageScheme. class); private GenericRecord e2; private AvroRecord avroRecord; public AvroMessageScheme () {}@ Override public List <Object> deserialize (byte [] bytes) {e2 = null; avroRecord = null; try {InputStream is = Thread. currentThread (). getContextClassLoader (). getResourceAsStream ("examples. avsc "); Schema schema = new Schema. parser (). parse (is); DatumReader <GenericRecord> datumReader = new GenericDatumReader <GenericRecord> (schema); Decoder decoder = DecoderFactory. get (). binaryDecoder (bytes, null); e2 = datumReader. read (null, decoder); avroRecord = new AvroRecord (e2);} catch (Exception e) {e. printStackTrace (); return new Values (avroRecord);} @ Override public Fields getOutputFields () {return new Fields ("msg ");}}


The following is a POJO class, which can actually be used to emit strings. This is more efficient.
Its AvroRecord POJO is as follows:

Public class AvroRecord implements Serializable {private String name; private int favorite_number; private String favorite_color; public AvroRecord (GenericRecord gr) {try {this. name = String. valueOf (gr. get ("name"); this. favorite_number = Integer. parseInt (gr. get ("favorite_number"); this. favorite_color = gr. get ("favorite_color "). toString ();} catch (Exception e) {logger. error ("read AvroRecord e Rror! ") ;}@ Override public String toString () {return" AvroRecord {"+" name = '"+ name +' \'' + ", favorite_number = "+ favorite_number +", favorite_color = '"+ favorite_color +' \ '+'} ';} public String getName () {return name ;} public void setName (String name) {this. name = name;} public String getFavorite_color () {return favorite_color;} public void setFavorite_color (String favorite_color) {this. favorite_color = favorite_color;} public int getFavorite_number () {return favorite_number;} public void setFavorite_number (int favorite_number) {this. favorite_number = favorite_number ;}}

In this example, I have not tested it and hope to use it with caution.


Reference

Https://blog.deck36.de/no-more-over-counting-making-counters-in-apache-storm-idempotent-using-redis-hyperloglog/

(4) custom Scheme for storm-kafka Source Code Reading

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.