1, the concept of watermark
Watermark is a mechanism for measuring the progress of event time, which is a hidden attribute of the data itself. Typically, data based on event time contains a timestamp, such as 1472693399700 (2016-09-01 09:29:59.700), and the watermark time of this data may be:
Watermark (1472693399700) = 1472693396700 (2016-09-01 09:29:56.700)
What is the meaning of the watermark time of this piece of data? That is: Timestamp is less than 1472693396700 (2016-09-01 09:29:56.700) of data, has arrived.
The blue dotted and solid lines in the figure represent the watermark time. 2. What is the use of watermark?
Watermark is used to deal with chaotic events, and the correct handling of sequence events, usually with the watermark mechanism combined with window to achieve.
We know that there is a process and time between flow processing from event generation to flow through source to operator. Although most of the time, the data flow to the operator is in accordance with the chronological order of events, but also do not rule out due to network, back pressure and other reasons, resulting in disorderly order generation (Out-of-order or late element).
But for late element, we can not wait indefinitely, there must be a mechanism to ensure that a certain time after the window must be triggered to calculate. This particular mechanism is watermark. 3, watermark how to allocate.
Typically, you should generate watermark immediately after receiving data from source, but you can also apply a simple map or filter operation after source, and then generate watermark.
There are 2 main types of generating watermark:
(1): with periodic watermarks
(2): with punctuated watermarks
The first can define a maximum allowable sequence of time, which is used more often.
We mainly focus on periodic watermarks to illustrate that the following is the way to generate periodic watermark:
The /** * This generator generates watermarks assuming, elements, out of order to a come certain. * The latest elements for a certain timestamp T would arrive at most \ milliseconds after the earliest * elements
Stamp T. * * Class Boundedoutofordernessgenerator extends Assignerwithperiodicwatermarks[myevent] {val maxoutoforderness = 350 0L;
3.5 seconds var Currentmaxtimestamp:long; Override Def Extracttimestamp (Element:myevent, previouselementtimestamp:long): Long = {val timestamp = element.
GetCreationTime () Currentmaxtimestamp = max (timestamp, currentmaxtimestamp) timestamp; Override Def Getcurrentwatermark (): watermark = {//Return of watermark as current highest timestamp minu
s the out-of-orderness bound new Watermark (currentmaxtimestamp-maxoutoforderness); }
}
There is a Extracttimestamp method in the program, which is based on the event time of the data itself; there is also a Getcurrentwatermar method that uses Currentmaxtimestamp- Maxoutoforderness to get.
The concept here is somewhat abstract, and we combine the data in the window to actually demonstrate how much timestamp and watermark each element is, and when to trigger the window. 4. Generate and track watermark code
4.1. Program Description
We receive the data from the socket and then extract timetamp and generate watermark immediately after the map, then apply window to see how the watermark and event time changes to cause window to be triggered.
4.2, the code is as follows
Import Java.text.SimpleDateFormat Import org.apache.flink.streaming.api.scala._ Import Org.apache.flink.streaming.api.TimeCharacteristic Import Org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks Import Org.apache.flink.streaming.api.scala.StreamExecutionEnvironment Import Org.apache.flink.streaming.api.scala.function.WindowFunction Import Org.apache.flink.streaming.api.watermark.Watermark Import Org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows Import Org.apache.flink.streaming.api.windowing.time.Time Import Org.apache.flink.streaming.api.windowing.windows.TimeWindow Import Org.apache.flink.util.Collector Object watermarktest {def main (args:array[string]): unit = {if (args.length!= 2) {System.err.println ("Usage:\ns Ocketwatermarktest
4.3. Detailed procedures
(1) Receive socket data
(2) Separate each line of data by character, each row of map into a tuple type (code,time)
(3) extracting timestamp to generate watermark. and print (Code,time, formatted time,currentmaxtimestamp,currentmaxtimestamp time, watermark time).
(4) Event time every 3 seconds to trigger a window, output (code, the number of elements in the window, the window of the earliest elements, the window of the most late element time, the window itself start time, the window itself end time)
Note: In the new assignerwithperiodicwatermarks[(String,long) There is a method of extracting timestamp and generating WATERMARK2, which, when executed, is the first to extract the timestamp, After the generation of watermark, so we print here watermark time, in fact, is the last one of the watermark time, we go to the data output to explain.
Generated job Graph 5, Watermark time by data tracking
We focus on watermark and timestamp, and look at the window's trigger time with the data.
First, we open the socket and enter the first piece of data:
000001,1461756862000
The output of the out file is as follows:
timestamp:000001,1461756862000|2016-04-27 19:34:22.000,1461756862000|2016-04-27 19:34:22.000,Watermark @-10000
Here, look at the value of watermark, 10000, or 0-10000. This means that the program executes the timestamp first and then executes the watermark. Therefore, each record print out of the watermark, should be the previous one of the watermark. For observation convenience, I have summarized the output as follows:
At this point, wartermark time in accordance with the logic, has lagged behind the currentMaxTimestamp10 seconds. We continue to enter:
At this point, the output reads as follows:
We are again summarized, see the following table:
We continue typing, and then we enter again:
The output is as follows:
The summary is as follows:
Here, window is still not triggered, at this time watermark is equal to the first data of the event times. So when is window exactly triggered? We again enter:
Output:
Summary:
Ok,window still did not trigger, at this time, our data has been sent to 2016-04-27 19:34:33.000, the earliest data has passed 11 seconds, has not begun to calculate. That is not to wait until the 13 (10+3) seconds passed before the start of the trigger window. The answer is in the negative.
We add another 1 seconds and enter:
Output:
Summary:
Here, let's make a note:
The trigger mechanism of Windows is to divide the windows by natural time, and if the window size is 3 seconds, the window will be divided into the following form in 1 minutes:
[00:00:00,00:00:03)
[00:00:03,00:00:06)
...
[00:00:57,00:01:00)
If the window size is 10 seconds, window is divided into the following form:
[00:00:00,00:00:10)
[00:00:10,00:00:20)
...
[00:00:50,00:01:00)
The window's setting is irrelevant to the data itself, but the system defines it.
In the data entered, according to its own Event time, the data into a different window, if there is data in the window, when watermark time >=event times, it meets the window trigger conditions, the final decision window trigger , or the Window_end_time in the window that the event time of the data itself belongs to.
In the test above, when the last data arrives, its watermark has risen to 19:34:24 seconds, just as the first record was in the window of the Window_end_time, so windows was triggered.
To verify the window's triggering mechanism, we continue to enter the data:
Output:
Summary:
At this point, the watermark time has reached the second data, but the window has not been triggered because it did not reach the end time of the second Data's window. So, the window time for the second data is:
[19:34:24,19:34:27)
That is, we have to enter a 19:34:27 second data, and the second window will be triggered. We continue to enter:
Output:
Summary:
At this point, we've seen that window triggers should meet the following conditions:
1, Watermark time >= window_end_time
2, there is data in [Window_start_time,window_end_time]
window is not triggered until the above 2 conditions are met.
Moreover, it should be emphasized here that watermark is a global value, not a value under a key, so that even if the data of the same key is not the same, its warmark will increase, for example:
Input:
000002,1461756879000
Output:
timestamp:000002,1461756879000|2016-04-27 19:34:39.000,1461756879000|2016-04-27 19:34:39.000,Watermark @ 1461756867000
We saw that the currentmaxtimestamp also increased. 6, Watermark+window processing chaos sequence
Our tests above, the data are incremented in chronological order, now, we enter some random (late) data, see how watermark Combined window mechanism, is how to deal with chaos.
Input:
Output:
Summary:
As you can see, although we have entered a 19:34:31 data, currentmaxtimestamp and watermark have not changed. At this point, according to the formula we mentioned above:
1, Watermark time >= window_end_time
2, there is data in [Window_start_time,window_end_time]
Watermark Time (19:34:29) < Window_end_time (19:34:33), so Windows cannot be triggered.
Then if we enter a 19:34:43 data again, the watermark time will rise to 19:34:33, then the window will be triggered, let's try this:
Input:
Output:
Here, I can see that there are 2 data in the window, 19:34:31 and 19:34:32, but there is no 19:34:33 data, because the window is a closed open interval, 19:34:33 data is a window of [19:34:33,19:34:36].
The above results have shown that for Out-of-order data, Flink can deal with the chaotic data in a certain range through the watermark mechanism combined with window operation. So for "late" too much data, Flink is how to deal with it. 7, the processing of late element
We're going to type in a bunch of random (actually, just event time < watermark times) data to test:
Input:
Output:
We see that the data we entered is 19:34:32, and the current watermark time has come to 19:34:33,event time < watermark times, so a single one will trigger a window. 8, summary
8.1, Flink How to deal with disorderly order.
Watermark+window mechanism
In Windows, input can be sorted according to event time so that the data is processed exactly in the order of event time to deal with the disorderly data.
8.2, when flink trigger window.
1. Event Time < watermark (for data with too many late element)
Or
1, Watermark time >= window_end_time (for Out-of-order and normal data)
2, there is data in [Window_start_time,window_end_time)
8.3, Flink should be how to set the maximum chaos sequence time.
This should be set up in conjunction with your business and data. If the maxoutoforderness set too small, and its own data sent due to the network and other reasons caused by chaos or late too much, then the end result is that there will be many single data in the window is triggered, the correctness of the data is too large.
Finally, we look at the relationship between Watermark, Event time, and window through a graph:
9, reference
http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/
Https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/event_time.html
Https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/event_timestamps_watermarks.html
Https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
Http://www.cnblogs.com/fxjwind/p/5627187.html