Iterative case of Flink stream processing

Source: Internet
Author: User
Tags apache flink

The current Flink focuses the iteration on batching, before we talk about bulk iterations and incremental iterations primarily for the batch (DataSet) API, and Flink provides targeted optimizations for iterations in batches. But for flow processing (DataStream), Flink also provides support for iterations, which we mainly analyze in the flow processing iterations, and we will see that the iterations in the stream processing are similar to the batch, but the differences are also very obvious.

An iterative stream handler allows you to define step functions and embed them in an iterative flow (Iterativestream). Because a stream handler may never terminate, the maximum number of iterations cannot be set in stream processing, unlike the iterative mechanism in batching. Instead, you can specify the maximum time interval to wait for feedback input, which terminates if no feedback element arrives at that interval. By applying the split or filter transformation, you can specify which part of the stream is used for feedback to the iteration head, and which part is distributed to the downstream. Here we use filter as an example to show the API usage patterns of an iterative flow handler.

First, build the Iterativestream based on the input stream, which is the beginning of an iteration, often referred to as an iterative head:

IterativeStream<Integer>= inputStream.iterate();

Next, we specify a series of conversion operations to express the logic that is executed during the iteration (here is simply a map transformation as an example), and the UDF accepted by the map API is the step function we said above:

DataStream<Integer> iteratedStream = iteration.map(/* this is executed many times */);

Then, as an iteration we definitely need to have data feedback to iterate over the iteration head, so we filter out a partial flow of eligible elements from the iterated stream, which we call the feedback flow:

DataStream<Integerof the stream */);

Feedback flow to the iteration head means that the complete logic of an iteration is complete, then it can "close" the Closed "ring". You can close an iteration by calling the instance method of the Iterativestream Closewith (which can also be expressed as defining the iteration tail). The data stream passed to Closewith will be fed back to the iteration head:

iteration.closeWith(feedbackStream);

In addition, a dominant pattern is to filter out a portion of the stream that needs to continue to be distributed, which in fact defines the logical condition of the "terminating iteration", and the eligible elements are distributed downstream instead of being used for the next iteration:

someofthe stream */);

As with the iterations in the analysis batch, we still take the case of the actual problem as a pointcut to see how the iterations in the stream process differ from the iterations in the batch.

First describe the problem that needs to be addressed: produces a stream of data consisting of a series of two tuples (two fields are positive integers produced within a range as two initial values of the Fibonacci sequence), and then iterates over the two tuples in the data stream to produce a Fibonacci sequence until the value produced is greater than the given threshold value , the iteration is stopped and the number of iterations is output.

This case is referenced from an iterative example of the Flink release with the source code, which is small and can explain the problem. But the series of variables in the sample code is slightly confusing, and in order to enhance the presentation of the program, I will adjust it slightly.

This case, if split into a single element (a two-tuple) perspective, its execution is as follows:

n represents the number of iterations, which is initialized to 0;m in the initial map transformation as a threshold for determining iteration stops;

In addition, the field index followed by T, such as T2, represents a field in the tuple that has a position of 3. And note that as the iteration T is constantly changing.

Above we have analyzed the core process of the problem, then we will solve the problem of the construction of the iterative flow handler.

First, we first create the initial stream object InputStream through the source function:

DataStream<Tuple2<IntegerInteger>> inputStream = env.addSource(new RandomFibonacciSource());

The source function generates a two-tuple sequence, and a two-tuple two-field value is randomly generated as the initial value of the Fibonacci sequence:

Private Static  class randomfibonaccisourceimplements sourcefunction<Tuple2 <integer, integer>> {                    PrivateRandom random =NewRandom ();PrivateVolatileBooleanIsRunning =true;Private intCounter =0; Public voidRun (Sourcecontext<tuple2<integer, integer>> ctx) throws Exception { while(isrunning && Counter < Max_random_value) {intFirst = Random.nextint (Max_random_value/2-1) +1;intSecond = Random.nextint (Max_random_value/2-1) +1;if(First > second)Continue; Ctx.collect (NewTuple2<integer, Integer> (first, second));                        counter++; Thread.Sleep ( -); }        } Public voidCancel () {isrunning =false; }}

To store the values in the newly computed Fibonacci sequence and the cumulative number of iterations, we need to convert the two-tuple data stream to a five-tuple data stream and create an iterative object accordingly:

IterativeStream<Tuple5<IntegerIntegerIntegerIntegerInteger>>=            inputStream.map(new TupleTransformMapFunction()).iterate(5000);

Note that the parameter 5000 of the iterate API in the preceding code snippet does not refer to iterations 5,000 times, but instead waits for the maximum time interval of the feedback input to 5 seconds. The stream is considered unbounded, so it is not possible to specify the maximum number of iterations as a batch iteration. However, it allows you to specify a maximum wait interval, and if no element arrives at a given time interval, the iteration is terminated.

The map function of a tuple transformation is implemented:

PrivateStatic class Tupletransformmapfunction extends Richmapfunction<Tuple2<Integer,Integer>, Tuple5<Integer,Integer,Integer,Integer,Integer>>{ PublicTuple5<Integer,Integer,Integer,Integer,Integer> Map(Tuple2<Integer,Integer>Inputtuples) throws Exception {return NewTuple5<Integer,Integer,Integer,Integer,Integer>(Inputtuples.F0, Inputtuples.F1, Inputtuples.F0, Inputtuples.F1,0); }}

In the above five tuples, the elements indexed as 0,1 are always the two elements originally generated, and the three fields will change with the iteration.

After the iteration flow Iterativestream is created, we will perform the step function of the Fibonacci sequence and generate the Fibonacci sequence flow fibonaccistream based on it:

DataStream<Tuple5<IntegerIntegerIntegerIntegerInteger>> fibonacciStream =            iterativeStream.map(new FibonacciCalcStepFunction());

The fibonaccistream here is just a McCartney, and the data is not the real Fibonacci sequence, which is actually the five-tuple above.

The step function which is used to calculate the Fibonacci sequence is implemented as follows:

PrivateStatic class Fibonaccicalcstepfunction extends Richmapfunction<Tuple5<Integer,Integer,Integer,Integer,Integer>, Tuple5<Integer,Integer,Integer,Integer,Integer>>{ PublicTuple5<Integer,Integer,Integer,Integer,Integer> Map(Tuple5<Integer,Integer,Integer,Integer,Integer>Inputtuple) throws Exception {return NewTuple5<Integer,Integer,Integer,Integer,Integer>(Inputtuple.F0, Inputtuple.F1, Inputtuple.F3, Inputtuple.F2+Inputtuple.F3,++Inputtuple.F4); }}

As mentioned above, the next three fields will change, before the calculation, the last element of the sequence is retained, that is, the corresponding element of F3, and then the F2 element plus F3 element will produce the latest value and update the F3 element, and F4 will accumulate.

As the number of iterations increases, not the entire sequence will be preserved, only the original two elements and the latest two elements will be retained, there is no need to preserve the entire sequence, because we do not need a complete sequence, we only need to determine the latest two elements.

We calculate the new value of the Fibonacci sequence for each element above and generate the Fibonaccistream, but we need to judge the latest two values to see if they exceed the specified threshold. Tuples that exceed the threshold will be output, and no more will participate in the iteration. So this will result in two different branches, and we have also built a branching branch for this:

SplitStream<Tuple5<IntegerIntegerIntegerIntegerInteger>> branchedStream =            fibonacciStream.split(new FibonacciOverflowSelector());

The implementation of determining whether a tuple that exceeds a threshold is judged and separated is as follows:

Private Static  class fibonaccioverflowselector implements Outputselector< Tuple5 <integer,     Integer, Integer, Integer, Integer >> {         PublicIterable<string> Select (Tuple5<integer, Integer, Integer, Integer, integer> inputtuple) {if(Inputtuple.f2 < Overflow_threshold && inputtuple.f3 < Overflow_threshold) {returnCollections.singleton (Iterate_flag); }returnCollections.singleton (Output_flag); }}

In the filter method Select, we identify different branches with different constant identifiers: Iterate_flag (also continue iteration) and Output_flag (direct output).

Once the branching is generated, we can check out the different flow branches to do iterative or output processing. For iterations that need to be iterated, the iterative flow's Closewith method is fed back to the iteration head:

iterativeStream.closeWith(branchedStream.select(ITERATE_FLAG));

For unwanted iterations, it flows directly downstream, where we simply "refactor" the flow and then output it directly:

DataStream<Tuple3<IntegerIntegerInteger>> outputStream = branchedStream            .select(OUTPUT_FLAG).map(new BuildOutputTupleMapFunction());outputStream.print();

The so-called refactoring is to re-reduce the previous five-tuple into triples, as follows:

PrivateStatic class Buildoutputtuplemapfunction extends Richmapfunction<Tuple5<Integer,Integer,Integer,Integer,Integer>, Tuple3<Integer,Integer,Integer>>{ PublicTuple3<Integer,Integer,Integer> Map(Tuple5<Integer,Integer,Integer,Integer,Integer>Inputtuple) throws Exception {return NewTuple3<Integer,Integer,Integer>(Inputtuple.F0, Inputtuple.F1, Inputtuple.F4); }}

Eventually we will get output similar to the following:

(7,14,5)

(18,37,3)

(3,46,3)

(23,32,3)

(31,43,2)

(13,45,2)

(37,42,2)

......

The first two integers are the two initial values of the Fibonacci sequence, and the third integer indicates how many iterations it takes to iterate over the last two values of its Fibonacci sequence to exceed the threshold value.

The final complete Skeleton program code is as follows:

public static void Main (string[] args) throws Exception {streamexecutionenvironment env = Streamexecutionenvironme Nt. Getexecutionenvironment(). Setbuffertimeout(1);Datastream<tuple2<integer, integer>> InputStream = env. Addsource(New Randomfibonaccisource ()); Iterativestream<tuple5<integer, Integer, Integer, Integer, integer>> iterativestream = Inpu TStream. Map(New Tupletransformmapfunction ()). Iterate( the); Datastream<tuple5<integer, Integer, Integer, Integer, integer>> fibonaccistream = iterative Stream. Map(New Fibonaccicalcstepfunction ()); Splitstream<tuple5<integer, Integer, Integer, Integer, integer>> branchedstream = Fibonacci Stream. Split(New Fibonaccioverflowselector ()); Iterativestream. Closewith(Branchedstream. Select(Iterate_flag)); Datastream<tuple3<integer, Integer, integer>> outputstream = Branchedstream. Select(Output_flag). Map(New Buildoutputtuplemapfunction ()); OutputStream. Print(); Env. Execute("Streaming Iteration Example");}

Scan code Attention public number: Apache_flink

QQ Scan Code concern QQ Group: Apache Flink Learning Exchange Group (123414680)

Iterative case of Flink stream processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.