SASE +: An Agile Language for Kleene Closure over Event Streams
This article does not translate the original text one by one, but extracts the essence of the original text and adds my understanding. The content in [] is my comment. Correction is not recommended.
SASE + is a complex event processing language. It supports the kilin closure to process event streams. Regular Expression matching has been well studied. The closure pattern applied to stream processing has the following features: Event definition, Event Selection, and termination principles. These features make it different from the pattern research in traditional problems.
1. Three features
Relevant Event Definition: Multiple predicates define related events. Such a predicate may specify a property value for a single event, or point out how the property value is compared with the previous event value, it also shows how the attribute value compares with the clustering result of the previous event sequence. [Note: related events are useful events that we are interested in. Otherwise, they are irrelevant events .]
Event Selection Strategy: For an Event stream that contains a mixture of related events and irrelevant events, the kilin closure determines how to select relevant events from it. Some queries only need to select "continuous" related events, and some need to filter out related events from the disassociation events. The latter requires that irrelevant events can be skipped for event handling and relevant events with "discontinuous" should be selected.
Termination Criteria: it specifies when the kilin closure terminates the computation. The input stream is an infinite event stream, and some queries may ignore irrelevant events and process as many subsequent events as possible. Therefore, the termination principle here is different from the traditional meaning. In the traditional sense, the input is a limited string and it cannot ignore any character.
2. SASE + Event Language
2.1 event stream model
Input event stream: Infinite event sequence. Each of these events occurs at a certain time point, atomic. Each event contains a set of type names and attribute values. Each event has a timestamp, Which is discrete and ordered. We assume that the timestamp is allocated by a separate mechanism before the event enters the processing system. At the same time, we assume that the event timestamp is monotonically increasing, so the event is ordered. A timestamp is an implicit attribute of an event. When performing a query, it can only be read but cannot be modified.
Output event stream: a sequence of events. Each event contains a finite set of attributes. The output model is an extension of the input model, which allows attributes to carry complex data types. Data types can be divided by two methods:
1) Atomic data type and sequence data type: values of the atomic data type are single and inseparable; values of the sequence data type are a sequence, and each value in the sequence can be atomic, it can also be a sequence.
2) simple data types and combined data types: simple data types do not need to be defined by other data types. The combined data types are defined by other data types.
Therefore, the output model has four data types: Atomic-simple, atomic-combination, sequence-simple, and sequence-combination. [Generally speaking, the sequence data type is an array, and the atomic data type is an element in an array. The simple data type is the basic data type in programming languages, such as int, char, float, etc, the combined data type is a struct or class.]
2.2 language Overview
The overall structure of SASE + is as follows:
[FROM <input stream>]
PATTERN <pattern structure>
[WHERE <pattern matching condition>]
[WITHIN <sliding window>]
[HAVING <pattern filtering condition>]
RETURN <output specification>
The following is the first example.
The result of Query1 is to get the total transaction volume of a Google stock within four hours after a bad message appears. The PATTERN clause declares the PATTERN structure. It uses the SEQ structure to represent a sequence PATTERN of two components: one component references an event of the NEWS type and the other component references a series of events of the STOCK type. The latter is identified by "+" and is used to represent one or more specific events. Each component uses a variable to represent the corresponding event. The two variables are a and B. "[]" Must be used to represent the array for component variables that use Klin Jia.
WHERE clause; optional. It contains value-based predicates and defines related events. It is similar to the where clause in SQL. B [I] (I> = 1) refers to each STOCK event. We call this predicate B [I]. symbol = 'goog' as the predicate of "independent iterator. The WITHIN clause specifies a time window in the entire mode, and limits the event occurrence time to WITHIN four hours.
The PATTERN, WHERE, and WITHIN clauses completely define a PATTERN. Their Calculation on the event stream produces pattern matching. Each pattern match is composed of a unique sequence of events stored in a and B [] to match the pattern.
The RETURN clause converts a pattern match to a result event. In this example, B [] uses an iterator In the event array, which extracts the volume attribute from each event and then aggregates the sum () function () it applies to all extracted values. The aggregate function creates an attribute of the "Atomic-simple" type.
Example 2:
Query2 gets a stock with the following characteristics: the stock price ranges from 10 to 20 within an hour, and the transaction volume remains stable. Skip till next match is an event selection policy. We will introduce the predicates in {} first.
The first predicate [symbol] requires the relevant STOCK event to have the same symbol attribute, which is called the "equivalence test "). It is equivalent to dividing the event stream into "partitions" based on specific attributes, and then matching the pattern on each partition. [Events with the same attribute value belong to the same partition, And the partition can be intermittent ." Partition "is equivalent to" group]
There are three predicates. The predicate on a [1] specifies the starting condition of the sequence, and the next predicate specifies the relationship between each pair of adjacent events in the sequence. This type of predicate compares each event with the previously selected event, which is called a "related iterator" predicate. The last predicate, a [a. LEN], defines the ending condition of the sequence. [Sequence is composed of selected events, that is, related events .]
After PATTERN, WHERE, and WITHIN generate a PATTERN match, the HAVING clause further filters events using the following predicates. Pattern matching that satisfies the HAVING clause will be retained and then output. The differences between WHERE and HAVING in SASE + are similar to SQL statements. The only difference is that the HAVING application matches each pattern. However, the HAVING application in SQL is applied to each Group created by Group.
The RETURN clause extracts a [1]. symbol and a []. price as two attributes that are included in the result event. Note that a []. price indicates an array of price values, so it is a "sequence-simple" data type.
Example 3:
Query3 captures a more complex trend for each type of Stock: in the last hour, the starting transaction volume is large, but after a period of time, when the price increases or remains relatively stable, the transaction volume suddenly drops a lot. The query structure is similar to Query2, but there are several differences. The PATTERN structure has two components, which are the same type of event STOCK, but one is the event array a [] and the other is a single event B. In the WHERE clause, the predicate acting on a [1] defines a starting condition using the volume attribute. Related iterator, a [I], requires that the price of each event be greater than the average price of the previously selected event. The aggregation used in the relevant iterator is called "running aggregation" ("running
Aggregates "). In this example, the explicit condition does not apply to a [a. LEN]. The last predicate compares the volume attribute of B with a [a. LEN.
For each pattern matching composed of a [] and B, RETURN converts it into a result event, which contains three attributes. A []. (price, volume) indicates that for each event in a [], select the price and volume attribute values and convert them into a combined data type. () Identifies it as a combined data type. Until all events are extracted, the final a []. (price, volume) is a "sequence-combination" data type. B. (price, volume) is the "Atomic-combination" data type.
The above describes the basic structure of SASE +. The predicates in the WHERE clause define events related to the closure. These predicates form the first dimensional Definition of the closure: Relevant Event Definition ).
The difficulty lies in the second dimension: the Event Selection policy.
2.2.1 Event Selection Policy
The Event Selection policy specifies how to select related events from the event stream where relevant events and irrelevant events are mixed.
Check Query2 again. It uses the equivalent test [symbol] and the complete form should be a [I]. symbo = a [I-1]. symbol. The iterator predicate a [I]. price> a [I-1]. price is also used. This defines the relationship between the two events. One thing that is not explained is the relationship between the locations of two related events a [I] And a [I-1] in the input stream. Possible positional relationships include:
Strict continuity: two related events in the input stream must be continuous, that is, no other events are allowed between them.
Partition continuity: two related events in the input stream may not be consecutive. However, if the event stream is divided into partitions, the last related event must be in the same partition as the previous event. For example, in Query2, the equivalent test forms a partition for each stock symbol, and then the query captures the monotonically increasing price trend on each partition. [Partition is grouped by a certain Attribute Value !!!] The conditions used to generate partitions can be any combination of predicates, but clustering functions cannot be used. This is richer than equivalent testing (equivalent testing is equivalent to group by in SQL or PARTITION BY in CQL ).
Skip till next match: in this policy, two related events are not necessarily partitionally contiguous. When selecting the next event, you can ignore irrelevant events and only compare the current event with the selected event. In this way, the closure can continue until the conditions of the termination principle are met. In this strategy, Query2 has different meanings: it ignores Fluctuating values in the middle when capturing the price trend from 10 to 20.
Skip till any match: this policy further relaxed the requirements for event selection. For each received event, it can make an uncertain decision: Put the event into the kilin closure, or ignore the event. This policy comes in handy when we need to ignore some related events to extend the calculation of the closure and to get a longer event sequence. For example, given a stock price sequence, "1, 2, 7, 3, 4, 5, 6", the longest sequence of incremental price events is "1, 2, 3, 4, 5, 6 ". To get this result, ignore 7, although 7 is also a related event.
In SASE +, the default Event Selection policy is skip till next match. When defining a query, you do not need to explicitly declare it, such as Query1. When partitions are used consecutively, the equivalence test in the query is used as the default partition condition. Of course, you can also use other predicates to define partitions.
2.2.2 termination Principle
The termination principle is the third dimension in the definition of the Kerlin closure.
The last event selection condition: Query2 shows the termination type. The a [a. LEN] In the WHERE clause specifies the conditions for the last selected event. This predicate does not indicate what action should be taken when such an event is detected, whether to terminate or continue to generate more matches? Considering the stock price changes in figure (a), a pattern match is generated from the sequence 1 to 2. However, there may be points with a price equal to 20, such as 3. In this way, locks can produce more matching results. For queries that do not care about such extra data (point 3), you can use a [a. LEN] As the termination principle. Add "! "
The next component in the mode: Query3 shows the termination type. This query does not use a predicate like a [a. LEN], but uses B as a subsequent component. Consider together figure (a) and figure (B). (a) shows price changes and (B) shows transaction volume changes. The Klin closure starts from 1 (volume> 1000 at) and then reaches 2 (by checking a [I]. price> avg ([.. i-1]. price )). The event at point 3 satisfies the predicate on B (B. volume <80% * a [a. LEN]. volume), and a pattern match is generated. However, the event of point 3 can also be used to continue the execution of the kilin closure, because it satisfies both a [I]. price> avg ([.. i-1]. price; this leads to another pattern match, which ends. In SASE +, when the next mode component in the query matches, use "! ", You can force the closure to terminate; or do not use "! ", Let the closure continue to produce another match.
Window limit: You can use a time window to terminate the kilin closure. This is the simplest one.
2.2.3 rejection in the kilin Closure
Query4 provides an example. Items are shipped from NewYork to Amherst, Which is scanned at each point on the way. This query is used to monitor abnormal events during transportation. Specifically, an item is scanned at the start of NewYork, and Amherst at the end is scanned, but it is not continuously scanned on the way, or more than three scan points are considered as exceptions. The AS structure is used IN the RETURN clause to rename the result. The in stream structure renames the output STREAM for later queries.
3. formal semantic model
The basis of the semantic model is the non-deterministic NFA with the match buffer ). We call this combination of automatic machines as more powerful than the standard NFA.
To be continued ......