Apache Flink Source Parsing stream-operator

Source: Internet
Author: User
Tags emit apache flink
<span id="Label3"></p><p><p>Before we talked about the transformation in Flink Stream. You can consider the <code>transformation</code> necessary components ( <strong>static representations</strong> ) for writing Flink programs and building streaming handlers, and this article will explore the <code>transformation</code> <strong>dynamic representation</strong> Of--operator in the Flink runtime. The mapping between them is shown in the Following:</p></p><p><p></p></p> <blockquote> <blockquote> <p>A detailed discussion can be viewed in the previous article: some core concepts in Flink</p> </blockquote> </blockquote>Streamoperator<p><p>All <code>operator</code> of the final base class, <code>operator</code> the classification method, according to the number of input streams are divided into:</p></p> <ul> <ul> <li>No input: Streamsource</li> <li>Single Stream input: Oneinputstreamoperator</li> <li>Two stream inputs: Twoinputstreamoperator</li> </ul> </ul><p><p>Core abstract methods related to life cycle:</p></p> <ul> <ul> <li>Setup: instantiation<code>operator</code></li> <li>Open: the method executes before any element is processed, and its implementation typically contains <code>operator</code> the initialization logic</li> <li>Close: This method is called after all elements have entered the <code>operator</code> processing</li> <li>Dispose: This method is <code>operator</code> executed in the final phase of the life cycle and is primarily used to recycle resources</li> </ul> </ul> <blockquote> <blockquote> <p><code>StreamOperator</code>And its implementation classes also contain some state-recovery-related logic, but these are not the subject of this article, all of which are not explored for the time Being.</p> </blockquote> </blockquote><p><p>Let's take a look at the class diagram for the entire package:</p></p><p><p></p></p><p><p>Our entire profiling approach is generally based <code>operator</code> on the above classification and class Hierarchy.</p></p>Streamsource<p><p>As a starting point for a stream processing dag, it <code>source operator</code> <code>operator</code> is undoubtedly special compared to the others (as can be seen from the inheritance diagram of the class).</p></p><p><p>It needs to accept <code>SourceFunction</code> the Instance. And we can see that it's <code>chaining strategy</code> <code>HEAD</code> (it means it <code>operator</code> can't have a predecessor <code>operator</code> , but it can be used as <code>operator</code> the other predecessor, as the <code>operator</code> following will talk about).</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs avrasm">this<span class="hljs-preprocessor">.chainingStrategy</span> = ChainingStrategy<span class="hljs-preprocessor">.HEAD</span><span class="hljs-comment">;</span></code></pre></pre><p><p><code>StreamSource</code>Implementation is slightly more complex because it involves the implementation we talked about earlier in the article <code>SourceFunction</code> <code>SourceFunction.SourceContext</code> . Here are three implementations that correspond to the three categories of event time that we talked about earlier Flink:</p></p><p><p></p></p> <ul> <ul> <li>Nontimestampcontext: for <code>ProcessingTime</code> , the timestamp <code>SourceContext</code> is set to <strong>-1</strong>, and does not emit<code>watermark</code></li> <li>Automaticwatermarkcontext: <code>IngestionTime</code> to provide an automatic <code>watermark</code> emission mechanism for<code>SourceContext</code></li> <li>Manualwatermarkcontext: for <code>EventTime</code> the artificial emission <code>watermark</code> of<code>SourceContext</code></li> </ul> </ul><p><p>The correspondence between them is also reflected in the <code>run</code> implementation of their methods:</p></p><pre class="prettyprint"><code class=" hljs javascript"> <span class="hljs-keyword"><span class="hljs-keyword">Switch</span></span>(timecharacteristic) {<span class="hljs-keyword"><span class="hljs-keyword"></span> case</span>Eventtime:ctx =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Manualwatermarkcontext<> (<span class="hljs-keyword"><span class="hljs-keyword"></span> this</span>, lockingobject, collector);<span class="hljs-keyword"><span class="hljs-keyword"></span> break</span>;<span class="hljs-keyword"><span class="hljs-keyword"></span> case</span>Ingestiontime:ctx =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Automaticwatermarkcontext<> (<span class="hljs-keyword"><span class="hljs-keyword"></span> this</span>, lockingobject, collector, getruntimecontext (). getexecutionconfig (). getautowatermarkinterval ());<span class="hljs-keyword"><span class="hljs-keyword"></span> break</span>;<span class="hljs-keyword"><span class="hljs-keyword"></span> case</span>Processingtime:ctx =<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Nontimestampcontext<> (<span class="hljs-keyword"><span class="hljs-keyword"></span> this</span>, lockingobject, collector);<span class="hljs-keyword"><span class="hljs-keyword"></span> break</span>;<span class="hljs-keyword"><span class="hljs-keyword">default</span></span>:<span class="hljs-keyword"><span class="hljs-keyword">Throw</span></span> <span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Exception (<span class="hljs-built_in"><span class="hljs-built_in">String</span></span>. valueOf (timecharacteristic)); }</code></pre><p><p>Methods that are <code>run</code> called inside a method <code>SourceFunction</code> <code>run</code> :</p></p><pre class="prettyprint"><code class=" hljs coffeescript"><span class="hljs-keyword"><span class="hljs-keyword">Try</span></span>{userfunction.run (ctx);<span class="hljs-regexp"><span class="hljs-regexp">//</span></span> <span class="hljs-keyword"><span class="hljs-keyword">if</span></span>We get here,<span class="hljs-keyword"><span class="hljs-keyword"></span> then</span>The user<span class="hljs-reserved"><span class="hljs-reserved">function</span></span>Either exited after being done (finite Source)<span class="hljs-regexp"><span class="hljs-regexp">//</span></span> <span class="hljs-keyword"><span class="hljs-keyword">or</span></span>The<span class="hljs-reserved"><span class="hljs-reserved">function</span></span>Was canceled<span class="hljs-keyword"><span class="hljs-keyword">or</span></span>Stopped. For the finite source<span class="hljs-reserved"><span class="hljs-reserved"></span> case</span>, we should emit<span class="hljs-regexp"><span class="hljs-regexp">//</span></span>A final watermark that indicates that we reached the end<span class="hljs-keyword"><span class="hljs-keyword"></span> of</span>Event-time<span class="hljs-keyword"><span class="hljs-keyword">if</span></span>(!iscanceledorstopped ()) {ctx.emitwatermark (watermark.max_watermark); } }<span class="hljs-keyword"><span class="hljs-keyword">finally</span></span>{<span class="hljs-regexp"><span class="hljs-regexp">//</span></span>Make sure the context<span class="hljs-keyword"><span class="hljs-keyword"></span> is</span>Closed<span class="hljs-keyword"><span class="hljs-keyword">inch</span></span>Any<span class="hljs-reserved"><span class="hljs-reserved"></span> case</span>Ctx.close (); }</code></pre><p><p><code>StreamSource</code>By a property: <code>canceledOrStopped</code> to control <code>sourceFunction</code> the Stop.</p></p><p><p>The entire <code>StreamSource</code> operating logic is expressed by the control of the <code>run</code> <code>cancel</code> stop Logic.</p></p>Nontimestampcontext<p><p><code>NonTimestampContext</code>The timestamp is ignored, so it's implemented in a slightly more specific place in the following two methods:</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs cs"><span class="hljs-keyword">public</span><span class="hljs-keyword">void</span><span class="hljs-title">collectWithTimestamp</span><span class="hljs-keyword">long</span> timestamp) { <span class="hljs-comment">// ignore the timestamp</span> collect(element);}</code></pre></pre><p><p>And</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs cs"><span class="hljs-keyword">public</span><span class="hljs-keyword">void</span><span class="hljs-title">emitWatermark</span>(Watermark mark) { owner.checkAsyncException(); <span class="hljs-comment">// do nothing else</span>}</code></pre></pre><p><p>The first method ignores the timestamp, and the second method does not send <code>watermark</code> .</p></p>Manualwatermarkcontext<p><p>No special instructions required</p></p>Automaticwatermarkcontext<p><p>The class is an automatically sent <code>watermark</code> implementation that receives parameters in the constructor <code>watermarkInterval</code> to specify <code>watermark</code> the time interval for automatic sending. The specific implementation mechanism is to create a separate launch thread to emit at the specified time Interval:</p></p><pre class="prettyprint"><code class=" hljs java"> <span class="hljs-keyword"><span class="hljs-keyword"></span> this</span>. Scheduleexecutor = Executors.newscheduledthreadpool (<span class="hljs-number"><span class="hljs-number">1</span></span>);<span class="hljs-keyword"><span class="hljs-keyword"></span> this</span>. Watermarktimer = Scheduleexecutor.scheduleatfixedrate (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Runnable () {<span class="hljs-annotation"><span class="hljs-annotation">@Override</span></span> <span class="hljs-keyword"><span class="hljs-keyword"></span> public</span> <span class="hljs-keyword"><span class="hljs-keyword">void</span></span> <span class="hljs-title"><span class="hljs-title">Run</span></span>() {<span class="hljs-keyword"><span class="hljs-keyword">Final</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>CurrentTime = System.currenttimemillis ();<span class="hljs-keyword"><span class="hljs-keyword">if</span></span>(currenttime > Nextwatermarktime) {<span class="hljs-comment"><span class="hljs-comment">//align The watermarks across all Machines.</span> this would ensure that we</span> <span class="hljs-comment"><span class="hljs-comment">//don ' t has watermarks that creep along at different intervals because</span></span> <span class="hljs-comment"><span class="hljs-comment">//</span> the Machine clocks is out of sync</span> <span class="hljs-keyword"><span class="hljs-keyword">Final</span></span> <span class="hljs-keyword"><span class="hljs-keyword">Long</span></span>Watermarktime = currenttime-(currenttime% watermarkinterval);<span class="hljs-keyword"><span class="hljs-keyword">synchronized</span></span>(lockingobjectparam) {<span class="hljs-keyword"><span class="hljs-keyword">if</span></span>(currenttime > Nextwatermarktime) {outputparam.emitwatermark (<span class="hljs-keyword"><span class="hljs-keyword">New</span></span>Watermark (watermarktime)); Nextwatermarktime + = watermarkinterval; } } } } },<span class="hljs-number"><span class="hljs-number">0</span></span>, watermarkinterval, timeunit.milliseconds);</code></pre><p><p>In addition to this time-based mechanism for firing at fixed frequencies, the current timestamp is <code>watermark</code> also checked when the Collect method is called, and is triggered if the send condition is reached <code>emit watermark</code> .</p></p><p><p>Because the class implements <strong>automatic sending</strong> , implementing a timed send mechanism in the constructor, the <code>emitWatermark</code> method does not need to implement the sending logic anymore (because the user program is no longer required to invoke the <code>emitWatermark</code> method), and the primary task of the method in the class is to be responsible for <strong>stopping </strong>automatically Sent. The trigger condition for stopping automatic sending is to receive a signal from the last element (the timestamp of the last element is set to <code>Long.MAX_VALUE</code> ), <code>emitWatermark</code> receive the identity, and then pass it downstream and close the timed send Thread.</p></p>Oneinputstreamoperator<p><p>Interface for a single input stream <code>operator</code> , inherited from <code>StreamOperator</code> . Two interface methods are available:</p></p> <ul> <ul> <li>Processelement: processing an element that arrives at that <code>operator</code></li> <li>Processwatermark: Handling A<code>Watermark</code></li> </ul> </ul>Twoinputstreamoperator<p><p>Two streams are supported as input <code>operator</code> , likewise inherited from <code>StreamOperator</code> . Multiple interface methods are expanded:</p></p> <ul> <ul> <li>ProcessElement1: processing an element from the first input</li> <li>ProcessElement2: processing an element from a second input</li> <li>ProcessWatermark1: processes one from the first input<code>Watermark</code></li> <li>ProcessWatermark2: processes one from the second input<code>Watermark</code></li> </ul> </ul>Auxiliary implementation class output<p><p><code>Collector</code>extension, which increases the functionality of the launch <code>WaterMark</code> . The interface is primarily intended <code>operator</code> for transmitting elements or <code>WaterMark</code> .</p></p> <ul> <ul> <li>Emitwatermark: the launch <code>WaterMark</code> will broadcast to all downstream<code>operator</code></li> </ul> </ul>Timecharacteristic<p><p>Flink divides time into three classes when it comes to time-related processing. The definition of a time type is represented by this enumeration in Flink:</p></p> <ul> <ul> <li>Processingtime</li> <li>Ingestiontime</li> <li>Eventtime</li> </ul> </ul> <blockquote> <blockquote> <p>These three types of time have been mentioned many times before and are no longer verbose here</p> </blockquote> </blockquote>Timestampedcollector<p><p><code>Output</code>Wrapper implementation that sets the timestamp for the element</p></p>Abstractstreamoperator<p><p>The abstract class provides basic support for implementing a specific <code>operator</code> , Flink built-in provided <code>operator</code> all directly or indirectly inherited from <code>AbstractStreamOperator</code> .</p></p><p><p>It contains three main classes of properties:</p></p> <ul> <ul> <li>Configuration Properties</li> <li>Run-time Properties</li> <li>Key-value Pair State Property</li> </ul> </ul><p><p>The majority method is an auxiliary method, and it is worth mentioning the <code>setup</code> method. From here we can see <code>operator</code> how all identifiers are Generated:</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs avrasm">String operatorIdentifier = getClass()<span class="hljs-preprocessor">.getSimpleName</span><span class="hljs-string">"_"</span> + config<span class="hljs-preprocessor">.getVertexID</span><span class="hljs-string">"_"</span> + runtimeContext<span class="hljs-preprocessor">.getIndexOfThisSubtask</span>()<span class="hljs-comment">;</span></code></pre></pre><p><p>You can see that the identity is stitched together by the three segments of the "_" Interval. The three paragraphs are: the class name, and the <code>vertex id</code> current <code>subtask</code> Index.</p></p><p><p>then, based on this identity, a storage state is created <code>stateBackend</code> :</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs fix"><span class="hljs-attribute">stateBackend </span>=<span class="hljs-string"> container.createStateBackend(operatorIdentifier, keySerializer);</span></code></pre></pre><p><p><code>stateBackend</code>Will be closed in the <code>dispose</code> method.</p></p> <blockquote> <blockquote> <p><code>AbstractStreamOperator</code>There is no specific implementation of the life-cycle approach such as open/close, and the implementation of these methods is deferred to the later mentioned <code>AbstractUdfStreamOperator</code> .</p> </blockquote> </blockquote>Abstractudfstreamoperator<p><p>This class <code>operator</code> provides template implementations primarily for life-cycle-related methods (open/close/dispose). These implementations are unified for user-defined <code>Function</code> instances (abbreviations <code>udf</code> ).</p></p>Chainingstrategy<p><p>The enumeration defines <code>operator</code> the <code>chain strategy</code> (link policy). When one <code>operator</code> is linked to its predecessor <code>operator</code> , it means that they will be executed on the same thread. <code>StreamOperator</code>The default value is <code>HEAD</code> , which means that it will not have a predecessor <code>operator</code> , but it may become the other predecessor <code>operator</code> <code>operator</code> . Most <code>StreamOperator</code> will overwrite the enumeration to <code>ALWAYS</code> indicate that they will be linked to a predecessor <code>operator</code> .</p></p><p><p>It's three enumeration Values:</p></p> <ul> <ul> <li>Always: As mentioned above, it allows the current <code>operator</code> link to a pre-set <code>operator</code> , which is a <strong>good practice</strong> for improving performance, it can improve <code>operator</code> the degree of parallelism</li> <li>Never: This policy does not support being <code>operator</code> linked to a predecessor <code>operator</code> or is not supported as a <code>operator</code> predecessor <code>operator</code> .</li> <li>HEAD: This policy indicates that <code>operator</code> there is no predecessor <code>operator</code> , but can be used as other <code>operator</code><code>chain header</code></li> </ul> </ul>Built-in operator Implementation Streamcounter<p><p>Element accumulator, Nothing Special.</p></p>Streamproject<p><p>Here we need to explain that project <strong></strong>here is not the meaning of what is usually meant by the project, but <code>投射、投影</code> the Meaning. You can use this analogy with clauses in SQL <code>SELECT</code> . So he allows you to choose the set you need <code>fields</code> . This is specified by a field index array of its constructor:</p></p><p><p>In the <code>processElement</code> method, it loops through all the required field indexes, extracts the required fields from the elements, puts them into one for output <code>outTuple</code> , and then emits them:</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs java"> <span class="hljs-keyword">public</span><span class="hljs-keyword">void</span><span class="hljs-title">processElement</span><span class="hljs-keyword">throws</span> Exception { <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span><span class="hljs-number">0</span><span class="hljs-keyword">this</span>.numFields; i++) { outTuple.setField(((Tuple) element.getValue()).getField(fields[i]), i); } output.collect(element.replace(outTuple)); }</code></pre></pre>Streamfilter<p><p><code>filter operator</code>, the processing logic is very simple, according to a custom <code>FilterFunction</code> method, each element is filtered, if the filter conditions are met, then the element <code>emit</code> Out.</p></p>Streammap<p><p><code>map operator</code>, which, depending on the incoming <code>MapFunction</code> , applies the action to each element and <code>map</code> emits it.</p></p>Streamflatmap<p><p><code>flatmap operator</code>The Receive <code>FlatMapFunction</code> function has something special: in its <code>open</code> method, it initializes one <code>TimestampedCollector</code> , as passed to <code>FlatMapFunction</code> <code>collector</code> , <code>collector</code> to those that are specific to <code>userFunction</code> use, and to set timestamps for the elements they manipulate.</p></p>Streamgroupedfold<p><p>, the <code>fold operator</code> <code>fold</code> execution of the function depends on an initialization value <code>initialValue</code> . So here comes the <strong>state preservation</strong> . And the state is associated with the specific Partition. therefore, in the implementation of the <code>open</code> method, you need to get the partition associated with it <code>ValueState</code> :</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs vhdl"> ValueStateDescriptor<<span class="hljs-keyword">OUT</span><span class="hljs-keyword">new</span><span class="hljs-keyword">null</span>); values = getPartitionedState(stateId);</code></pre></pre><p><p><code>processElement</code>The implementation of the method involves a series of operations: <code>ValueState</code> getting the data from the, as the "new" <strong>initial value</strong> with the current element of the <code>fold</code> function operation, to obtain the results after the update <code>ValueState</code> , and then the results obtained <code>emit</code> .</p></p>Streamgroupedreduce<p><p>Operate by grouping <code>reduce</code> <code>operator</code> .</p></p><p><p>Based on a specific state name:</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs java"><span class="hljs-keyword">private</span><span class="hljs-keyword">static</span><span class="hljs-keyword">final</span><span class="hljs-string">"_op_state"</span>;</code></pre></pre><p><p>Build Status ID</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs vhdl">ValueStateDescriptor<<span class="hljs-keyword">IN</span><span class="hljs-keyword">new</span><span class="hljs-keyword">null</span>);</code></pre></pre><p><p>Then get the status value:</p></p><pre class="prettyprint"><pre class="prettyprint"><code class=" hljs fix"><span class="hljs-attribute">values </span>=<span class="hljs-string"> getPartitionedState(stateId);</span></code></pre></pre><p><p>The above two actions are <code>open</code> implemented in a method</p></p><p><p>In the <code>processElement</code> method, there are two cases:</p></p> <ul> <ul> <li>If a status value has previously existed, then take the current value with the previous state value <code>reduce</code> and get the result, update the result again to the latest state and <code>emit</code> go out</li> <li>If the status value does not exist before, update the current value directly to the state and the current value <code>emit</code> out</li> </ul> </ul>Streamsink<p><p><code>sink operator</code>, which is usually the last of the stream processing <code>operator</code> . <code>SinkFunction</code>the instance it receives. In <code>processElement</code> turn, call its <code>invoke</code> method in Order.</p></p>Summary<p><p>This paper mainly discusses the <code>transformation</code> implementation of Stream's running-time form <code>operator</code> .</p></p> <blockquote> <blockquote> <p>Scan code attention public Number: Apache_flink</p> </blockquote> </blockquote><p><p></p></p> <blockquote> <blockquote> <p>QQ Scan Code concern QQ group: Apache Flink Learning Exchange Group (123414680)</p> </blockquote> </blockquote><p><p></p></p> <p><p> Apache flink Source parsing stream-operator </p> </p></span>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.