Streaming SQL in calcite

Source: Internet
Author: User
Tags benchmark

The general design of flow-type SQL in calcite flow sqlcalcite

The overall syntax should be compatible with SQL, which is consistent with the current trend of stream processing SQL.
If some of the features are not included in standard SQL, try to use the industry benchmark (Oracle). For example, the function of pattern matching, the current stream processing has not yet reached a consensus on the syntax, then in the design, the use of Oracle Data Warehouse match recognize way. There are also sliding window functions.
If there is no current industry benchmark, then the function of the way to expand, tumbling windows and bounce windows, these two windows are not included in standard SQL, so the use of ceil,tumble,hop functions such as the way to achieve the function.
The general idea is to do as little as possible on the basis of compatible standard SQL, to ensure that semantics are consistent with standard SQL and to minimize the syntax of privatization .

Calcite StreamSQL Description
    1. Explicitly defined in the DDL schema is a stream or table, TODO: No examples on official website, to be supplemented
      For example, there are three schemas:
      Orders (Rowtime, ProductId, orderId, Units)-both the table and the stream
      Products (Rowtime, productId, name)-table
      Shipments (Rowtime, orderId)-Flow
    2. If the query contains the Stream keyword, it is a streaming query, if not included, is a table query. A table query can return results immediately and end, and streaming queries will only output results but not end.
      For example, here is a sample stream query:
SELECT STREAM *FROM Orders;  rowtime | productId | orderId | units----------+-----------+---------+------- 10:17:00 |        30 |       5 |     4 10:17:05 |        10 |       6 |     1 10:18:05 |        20 |       7 |     2 10:18:07 |        30 |       8 |    20 11:02:00 |        10 |       9 |     6 11:04:00 |        10 |      10 |     1 11:09:30 |        40 |      11 |    12 11:24:11 |        10 |      12 |     4

Table Query Example:

SELECT *FROM Orders;  rowtime | productId | orderId | units----------+-----------+---------+------- 08:30:00 |        10 |       1 |     3 08:45:10 |        20 |       2 |     1 09:12:21 |        10 |       3 |    10 09:27:44 |        30 |       4 |     24 records returned.

Stream and table queries cannot be mixed, or they will be an error

SELECT * FROM Shipments;ERROR: Cannot convert stream ‘SHIPMENTS‘ to a tableSELECT STREAM * FROM Products;ERROR: Cannot convert table ‘PRODUCTS‘ to a stream
    1. Other filtering, sorting, having, and so on, are consistent with standard SQL, no longer an example. The
    2. subquery only needs to write the Stream keyword in the outer statement, and the inner layer writes invalid.
      such as:
select  STREAM rowtime, productId from  (select  tumble_end (rowtime, interval   ' 1 '  hour ) as  Rowtime, ProductId, count  (*) as  C, sum  (Units) as  su from  Orders group  by  tumble (rowtime, interval   ' 1 '  hour  ), productId) where  C > 2  or  su > 10 ; Rowtime | ProductId----------+-----------10:00:00 |        30 11:00:00 |        10 11:00:00 | 40
Window function Description Tumbling window (Tumbling window)

Data does not overlap between two windows.

SELECT  STREAM Ceil (rowtime to  hour ) as  rowtime, productId, count  (*) as  C, sum  (Units) as  Unitsfrom  ordersgroup  by  Ceil (rowtime to  hour ), productId;  Rowtime | ProductId | C | Units----------+-----------+---------+------- 11:00:00 | 30 | 2 | 24 11:00:00 | 10 | 1 | 1 11:00:00 | 20 | 1 | 7 12:00:00 | 10 | 3 | 11 12:00:00 | 40 | 1 | 

The example outputs the statistical results of this hour at the end of each hour, 11 o'clock, and outputs 1 points. Event-driven, internal timer not included.
The following example is equivalent to the above example

 SELECT STREAM tumble_end (rowtime, INTERVAL ' 1 ' HOUR)  as Rowtime, ProductId, COUNT(*) as C, SUM(Units)  as units from  OrdersGROUP by Tumble (Rowtime, INTERVAL ' 1 ' HOUR), productId;  Rowtime |       ProductId | C | Units----------+-----------+---------+-------11:00:00 |       30 |    2 |        24 11:00:00 |       10 |     1 |        1 11:00:00 |       20 |     1 |        7 12:00:00 |       10 |    3 |        11 12:00:00 |       40 |    1 | 12

Another example, it takes less than half an hour to output a result, to 12 minutes for the alignment time,

 SELECT STREAM tumble_end (rowtime, INTERVAL ' MINUTE, time  ' 0:12 ')  as Rowtime, ProductId, COUNT(*)  as C, SUM(Units)  as units< C12>from OrdersGROUP  by Tumble (Rowtime, INTERVAL ' MINUTE ,  time ' 0:12 '), productId;Rowtime |       ProductId | C | Units----------+-----------+---------+-------10:42:00 |       30 |    2 |        24 10:42:00 |       10 |     1 |        1 10:42:00 |       20 |     1 |        7 11:12:00 |       10 |     2 |        7 11:12:00 |       40 |    1 |        12 11:42:00 |       10 |     1 | 4
Bounce window (HOP window)

There is a certain overlap between the data in the two windows.
Bounce windows are generalized tumbling windows that allow data to be stored in a window for longer periods of time.
For example, the data output time is 11:00, but it also contains data from 08:00 to 11:00, and 09:00 to 12:00, and an input row for three output lines.

 SELECT STREAM hop_end (rowtime, INTERVAL ' 1 ' HOUR, INTERVAL ' 3 '  HOUR)  as Rowtime, COUNT(*)  as C, SUM(Units)  as units< C13>from OrdersGROUP  by HOP (Rowtime, INTERVAL ' 1 ' HOUR, INTERVAL ' 3 ' HOUR);Rowtime | C | Units----------+----------+-------11:00:00 |    4 |        27 12:00:00 |    8 | 50
Sliding window (sliding window)

The sliding window in calcite uses the standard over mode, which directly applies the analytic functions in standard SQL.

SELECT STREAM rowtime,  productId,  units,  SUM(units) OVER (ORDER BY rowtime RANGE INTERVAL ‘1‘ HOUR PRECEDING) unitsLastHourFROM Orders;

The following example shows the average order size for the past 10 minutes and the comparison data for last week's average order

select  STREAM *FROM  (select  STREAM rowtime, productId, units, avg  (Units) over product (RANGE interval   ' 10 '  minute  preceding) as  M10, avg  (Units) over product (RANGE interval   ' 7 '  day  preceding) as  D7 Span class= "Hljs-keyword" >from  Orders WINDOW product as  (order  by  rowtime PARTITION by  productId)) where  M10 > D7; 

In this example, the window clause is used to define the partial windows, and then the refinement is done in each of the over clauses. For the first time, you can also define all the windows in the window clause.
In this way, two tables, 10-minute and 7-day window data are maintained in the background. You can access these tables directly and do not need to make the displayed queries.

This syntax can also implement some other features:
* Row Group window
* References to rows that have not yet arrived, the stream waits until they arrive.
* Can support other rank and other statistical analysis functions

Cascade Window (Overlay window)

The following query shows a scenario that returns the statistical results for each record, but the result is reset at a fixed time.

SELECT STREAM rowtime,  productId,  units,  SUM(units) OVER (PARTITION BY FLOOR(rowtime TO HOUR)) AS unitsSinceTopOfHourFROM Orders;

This is similar to a sliding window query, but the monotone expression occurs in the partition by clause. As time goes from 10:59:59 to 11:00:00,floor from 10:00:00 to 11:00:00, a new grouping starts to occur. The consolidated result of sum begins to reset.
Calcite knows that the old grouping will never be used again, so all the statistical results of that grouping are removed from the internal store.

Row Group window

Using the window syntax and over mode can be done.

Monotonous and quasi-monotonous

This is the concept put forward by the author in the StreamSQL of calcite.
If a column or expression is incremented or decremented, it becomes monotonous.
If a column or expression is out of order and there is a mechanism (such as a punctuation mark or watermark) to generate a particular value that will never be seen, then the column or expression is quasi-monotonic.
The concept is very south, but in fact it is required that the flow of data is globally ordered. This can be the order of events, or the order of event IDs. In general, we automatically make time for events.
With this order, we can easily implement the watermark function.

Description of the Stream and table
create  VIEW  Hourlyordertotals (Rowtime, ProductId, C, su) as  select  tumble_end (rowtime, interval   ' 1 '  hour ), productId, count  (*), sum  (Units) from  Orders group  Span class= "Hljs-keyword" >by  tumble (rowtime, interval   ' 1 '  hour ), productId;  SELECT STREAM rowtime, productId from hourlyordertotalsWHERE C > 2 OR su > ten;Rowtime | ProductId----------+-----------10:00:00 |        30 11:00:00 |        10 11:00:00 | 40

Look at the view above, is this a table or a stream?
Because it doesn't use the stream keyword, it's bound to be a relationship, a table, but it's a table that can be converted to a stream . You can use it in queries for flows and relationships.
And it's equivalent to the query:

With Hourlyordertotals (Rowtime, ProductId, C, Su) as ( SELECT tumble_end (rowtime, INTERVAL ' 1 ' HOUR), productId, COUNT(*), SUM  (units)  from Orders GROUP  by Tumble (Rowtime, INTERVAL ' 1 ' HOUR), productId) SELECT STREAM rowtime, ProductId from hourlyordertotalsWHERE C > 2 OR su > C18>10;Rowtime | ProductId----------+-----------10:00:00 |        30 11:00:00 |        10 11:00:00 | 40

This approach is not limited to subqueries and views, and each query in streaming SQL is defined as a relational query and is converted to a stream using the Stream keyword in the topmost select clause .

The join on the stream is divided into two types, the join of the stream and the table, and the join of the stream and the stream
A join on a stream is actually a join of a window and a window, or a join between a window and a table, essentially a join between tables, because a window is a table .
For example, the following join between a stream and a table

  SELECT STREAM o.rowtime, O.productid, O.orderid, O.units, P.name, P.unitprice from Orders as
       oJOIN products  as P  on o.productid = P.productid;Rowtime | ProductId | OrderId | Units | name | UnitPrice----------+-----------+---------+-------+ -------+-----------10:17:00 |       30 |     5 | 4 |        Cheese |        17 10:17:05 |       10 |     6 | 1 |      Beer |        0.25 10:18:05 |       20 |     7 | 2 |         Wine |        6 10:18:07 |       30 |    8 | 20 |        Cheese |        17 11:02:00 |       10 |     9 | 6 |      Beer |        0.25 11:04:00 |      10 |     10 | 1 |      Beer |        0.25 11:09:30 |      40 |    11 | 12 |       Bread |        100 11:24:11 |      10 |     12 | 4 |      Beer | 0.25

Order is a stream, and products are tables. After two joins the result is definitely a stream, and then, because there is no window, by default it should be a window that only holds the current data for a length of 1, and the current order data and products do join.
The join of streams and streams is as follows:

SELECT  STREAM o.rowtime, O.productid, O.orderid, s.rowtime as  shiptimefrom  Orders as  ojoin  Shipments as  s on  o.orderid = S.orderid and  s.rowtime between O.rowtime and  o.rowtime + interval   ' 1 '  hour   ; Rowtime | ProductId | OrderId |        Shiptime----------+-----------+---------+---------- 10:17:00 |       30 | 5 |        10:55:00-10:17:05 |       10 | 6 |        10:20:00-11:02:00 |       10 | 9 |        11:58:00-11:24:11 |      10 | 12 | 11:44:00  

There is no explicit definition window in this query, but the data range is actually locked by the where condition. In other words, the data is automatically saved in a window.

DML statements

You can use the CREATE VIEW statement for creating views, which you can already see, and you can use the Insert as Select method to import data from the stream into other streams.
Like what:

CREATE VIEW LargeOrders ASSELECT STREAM * FROM Orders WHERE units > 1000;INSERT INTO LargeOrdersSELECT STREAM * FROM Orders WHERE units > 1000;

You can also maintain window data through Upsert statements.

UPSERT INTO OrdersSummarySELECT STREAM productId,  COUNT(*) OVER lastHour AS cFROM OrdersWINDOW lastHour AS (  PARTITION BY productId  ORDER BY rowtime  RANGE INTERVAL ‘1‘ HOUR PRECEDING)
The development plan has been completed
    • SELECT on stream, WHERE, GROUP by, have, UNION all, ORDER by
    • Floor and Ceil functions
    • Monotonicity (monotonicity)
    • Disabling streaming result Sets
      Like what:
    SELECT STREAM * FROM (VALUES (1, ‘abc‘));    ERROR: Cannot stream VALUES
Not completed
    • Joins of streams and streams
    • Joins of streams and tables
    • View-based streaming
    • The union All (stream merge) on the stream that contains the order by
    • A relational query on a stream
    • Window aggregation on a stream (sliding window and Cascade window)
    • Ignore stream keywords in views and subqueries
    • The order by on the stream cannot contain offset and limit
    • Run time check if there is enough history data to query
    • Quasi-monotonic – a mechanism is required to declare that once the data has been calculated, it is no longer updated, or after the calculation has been completed, if the latest results are refreshed, then updated. such as the watermark function.
    • HOP and tumble functions, as well as auxiliary hop_start, Hop_end, Tumble_start, tumble_end functions

Reference:
Http://calcite.apache.org/docs/stream.html

Streaming SQL in calcite

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.