The general design of flow-type SQL in calcite flow sqlcalcite
The overall syntax should be compatible with SQL, which is consistent with the current trend of stream processing SQL.
If some of the features are not included in standard SQL, try to use the industry benchmark (Oracle). For example, the function of pattern matching, the current stream processing has not yet reached a consensus on the syntax, then in the design, the use of Oracle Data Warehouse match recognize way. There are also sliding window functions.
If there is no current industry benchmark, then the function of the way to expand, tumbling windows and bounce windows, these two windows are not included in standard SQL, so the use of ceil,tumble,hop functions such as the way to achieve the function.
The general idea is to do as little as possible on the basis of compatible standard SQL, to ensure that semantics are consistent with standard SQL and to minimize the syntax of privatization .
Calcite StreamSQL Description
- Explicitly defined in the DDL schema is a stream or table, TODO: No examples on official website, to be supplemented
For example, there are three schemas:
Orders (Rowtime, ProductId, orderId, Units)-both the table and the stream
Products (Rowtime, productId, name)-table
Shipments (Rowtime, orderId)-Flow
- If the query contains the Stream keyword, it is a streaming query, if not included, is a table query. A table query can return results immediately and end, and streaming queries will only output results but not end.
For example, here is a sample stream query:
SELECT STREAM *FROM Orders; rowtime | productId | orderId | units----------+-----------+---------+------- 10:17:00 | 30 | 5 | 4 10:17:05 | 10 | 6 | 1 10:18:05 | 20 | 7 | 2 10:18:07 | 30 | 8 | 20 11:02:00 | 10 | 9 | 6 11:04:00 | 10 | 10 | 1 11:09:30 | 40 | 11 | 12 11:24:11 | 10 | 12 | 4
Table Query Example:
SELECT *FROM Orders; rowtime | productId | orderId | units----------+-----------+---------+------- 08:30:00 | 10 | 1 | 3 08:45:10 | 20 | 2 | 1 09:12:21 | 10 | 3 | 10 09:27:44 | 30 | 4 | 24 records returned.
Stream and table queries cannot be mixed, or they will be an error
SELECT * FROM Shipments;ERROR: Cannot convert stream ‘SHIPMENTS‘ to a tableSELECT STREAM * FROM Products;ERROR: Cannot convert table ‘PRODUCTS‘ to a stream
- Other filtering, sorting, having, and so on, are consistent with standard SQL, no longer an example. The
- subquery only needs to write the Stream keyword in the outer statement, and the inner layer writes invalid.
such as:
select STREAM rowtime, productId from (select tumble_end (rowtime, interval ' 1 ' hour ) as Rowtime, ProductId, count (*) as C, sum (Units) as su from Orders group by tumble (rowtime, interval ' 1 ' hour ), productId) where C > 2 or su > 10 ; Rowtime | ProductId----------+-----------10:00:00 | 30 11:00:00 | 10 11:00:00 | 40
Window function Description Tumbling window (Tumbling window)
Data does not overlap between two windows.
SELECT STREAM Ceil (rowtime to hour ) as rowtime, productId, count (*) as C, sum (Units) as Unitsfrom ordersgroup by Ceil (rowtime to hour ), productId; Rowtime | ProductId | C | Units----------+-----------+---------+------- 11:00:00 | 30 | 2 | 24 11:00:00 | 10 | 1 | 1 11:00:00 | 20 | 1 | 7 12:00:00 | 10 | 3 | 11 12:00:00 | 40 | 1 |
The example outputs the statistical results of this hour at the end of each hour, 11 o'clock, and outputs 1 points. Event-driven, internal timer not included.
The following example is equivalent to the above example
SELECT STREAM tumble_end (rowtime, INTERVAL ' 1 ' HOUR) as Rowtime, ProductId, COUNT(*) as C, SUM(Units) as units from OrdersGROUP by Tumble (Rowtime, INTERVAL ' 1 ' HOUR), productId; Rowtime | ProductId | C | Units----------+-----------+---------+-------11:00:00 | 30 | 2 | 24 11:00:00 | 10 | 1 | 1 11:00:00 | 20 | 1 | 7 12:00:00 | 10 | 3 | 11 12:00:00 | 40 | 1 | 12
Another example, it takes less than half an hour to output a result, to 12 minutes for the alignment time,
SELECT STREAM tumble_end (rowtime, INTERVAL ' MINUTE, time ' 0:12 ') as Rowtime, ProductId, COUNT(*) as C, SUM(Units) as units< C12>from OrdersGROUP by Tumble (Rowtime, INTERVAL ' MINUTE , time ' 0:12 '), productId;Rowtime | ProductId | C | Units----------+-----------+---------+-------10:42:00 | 30 | 2 | 24 10:42:00 | 10 | 1 | 1 10:42:00 | 20 | 1 | 7 11:12:00 | 10 | 2 | 7 11:12:00 | 40 | 1 | 12 11:42:00 | 10 | 1 | 4
Bounce window (HOP window)
There is a certain overlap between the data in the two windows.
Bounce windows are generalized tumbling windows that allow data to be stored in a window for longer periods of time.
For example, the data output time is 11:00, but it also contains data from 08:00 to 11:00, and 09:00 to 12:00, and an input row for three output lines.
SELECT STREAM hop_end (rowtime, INTERVAL ' 1 ' HOUR, INTERVAL ' 3 ' HOUR) as Rowtime, COUNT(*) as C, SUM(Units) as units< C13>from OrdersGROUP by HOP (Rowtime, INTERVAL ' 1 ' HOUR, INTERVAL ' 3 ' HOUR);Rowtime | C | Units----------+----------+-------11:00:00 | 4 | 27 12:00:00 | 8 | 50
Sliding window (sliding window)
The sliding window in calcite uses the standard over mode, which directly applies the analytic functions in standard SQL.
SELECT STREAM rowtime, productId, units, SUM(units) OVER (ORDER BY rowtime RANGE INTERVAL ‘1‘ HOUR PRECEDING) unitsLastHourFROM Orders;
The following example shows the average order size for the past 10 minutes and the comparison data for last week's average order
select STREAM *FROM (select STREAM rowtime, productId, units, avg (Units) over product (RANGE interval ' 10 ' minute preceding) as M10, avg (Units) over product (RANGE interval ' 7 ' day preceding) as D7 Span class= "Hljs-keyword" >from Orders WINDOW product as (order by rowtime PARTITION by productId)) where M10 > D7;
In this example, the window clause is used to define the partial windows, and then the refinement is done in each of the over clauses. For the first time, you can also define all the windows in the window clause.
In this way, two tables, 10-minute and 7-day window data are maintained in the background. You can access these tables directly and do not need to make the displayed queries.
This syntax can also implement some other features:
* Row Group window
* References to rows that have not yet arrived, the stream waits until they arrive.
* Can support other rank and other statistical analysis functions
Cascade Window (Overlay window)
The following query shows a scenario that returns the statistical results for each record, but the result is reset at a fixed time.
SELECT STREAM rowtime, productId, units, SUM(units) OVER (PARTITION BY FLOOR(rowtime TO HOUR)) AS unitsSinceTopOfHourFROM Orders;
This is similar to a sliding window query, but the monotone expression occurs in the partition by clause. As time goes from 10:59:59 to 11:00:00,floor from 10:00:00 to 11:00:00, a new grouping starts to occur. The consolidated result of sum begins to reset.
Calcite knows that the old grouping will never be used again, so all the statistical results of that grouping are removed from the internal store.
Row Group window
Using the window syntax and over mode can be done.
Monotonous and quasi-monotonous
This is the concept put forward by the author in the StreamSQL of calcite.
If a column or expression is incremented or decremented, it becomes monotonous.
If a column or expression is out of order and there is a mechanism (such as a punctuation mark or watermark) to generate a particular value that will never be seen, then the column or expression is quasi-monotonic.
The concept is very south, but in fact it is required that the flow of data is globally ordered. This can be the order of events, or the order of event IDs. In general, we automatically make time for events.
With this order, we can easily implement the watermark function.
Description of the Stream and table
create VIEW Hourlyordertotals (Rowtime, ProductId, C, su) as select tumble_end (rowtime, interval ' 1 ' hour ), productId, count (*), sum (Units) from Orders group Span class= "Hljs-keyword" >by tumble (rowtime, interval ' 1 ' hour ), productId; SELECT STREAM rowtime, productId from hourlyordertotalsWHERE C > 2 OR su > ten;Rowtime | ProductId----------+-----------10:00:00 | 30 11:00:00 | 10 11:00:00 | 40
Look at the view above, is this a table or a stream?
Because it doesn't use the stream keyword, it's bound to be a relationship, a table, but it's a table that can be converted to a stream . You can use it in queries for flows and relationships.
And it's equivalent to the query:
With Hourlyordertotals (Rowtime, ProductId, C, Su) as ( SELECT tumble_end (rowtime, INTERVAL ' 1 ' HOUR), productId, COUNT(*), SUM (units) from Orders GROUP by Tumble (Rowtime, INTERVAL ' 1 ' HOUR), productId) SELECT STREAM rowtime, ProductId from hourlyordertotalsWHERE C > 2 OR su > C18>10;Rowtime | ProductId----------+-----------10:00:00 | 30 11:00:00 | 10 11:00:00 | 40
This approach is not limited to subqueries and views, and each query in streaming SQL is defined as a relational query and is converted to a stream using the Stream keyword in the topmost select clause .
The join on the stream is divided into two types, the join of the stream and the table, and the join of the stream and the stream
A join on a stream is actually a join of a window and a window, or a join between a window and a table, essentially a join between tables, because a window is a table .
For example, the following join between a stream and a table
SELECT STREAM o.rowtime, O.productid, O.orderid, O.units, P.name, P.unitprice from Orders as
oJOIN products as P on o.productid = P.productid;Rowtime | ProductId | OrderId | Units | name | UnitPrice----------+-----------+---------+-------+ -------+-----------10:17:00 | 30 | 5 | 4 | Cheese | 17 10:17:05 | 10 | 6 | 1 | Beer | 0.25 10:18:05 | 20 | 7 | 2 | Wine | 6 10:18:07 | 30 | 8 | 20 | Cheese | 17 11:02:00 | 10 | 9 | 6 | Beer | 0.25 11:04:00 | 10 | 10 | 1 | Beer | 0.25 11:09:30 | 40 | 11 | 12 | Bread | 100 11:24:11 | 10 | 12 | 4 | Beer | 0.25
Order is a stream, and products are tables. After two joins the result is definitely a stream, and then, because there is no window, by default it should be a window that only holds the current data for a length of 1, and the current order data and products do join.
The join of streams and streams is as follows:
SELECT STREAM o.rowtime, O.productid, O.orderid, s.rowtime as shiptimefrom Orders as ojoin Shipments as s on o.orderid = S.orderid and s.rowtime between O.rowtime and o.rowtime + interval ' 1 ' hour ; Rowtime | ProductId | OrderId | Shiptime----------+-----------+---------+---------- 10:17:00 | 30 | 5 | 10:55:00-10:17:05 | 10 | 6 | 10:20:00-11:02:00 | 10 | 9 | 11:58:00-11:24:11 | 10 | 12 | 11:44:00
There is no explicit definition window in this query, but the data range is actually locked by the where condition. In other words, the data is automatically saved in a window.
DML statements
You can use the CREATE VIEW statement for creating views, which you can already see, and you can use the Insert as Select method to import data from the stream into other streams.
Like what:
CREATE VIEW LargeOrders ASSELECT STREAM * FROM Orders WHERE units > 1000;INSERT INTO LargeOrdersSELECT STREAM * FROM Orders WHERE units > 1000;
You can also maintain window data through Upsert statements.
UPSERT INTO OrdersSummarySELECT STREAM productId, COUNT(*) OVER lastHour AS cFROM OrdersWINDOW lastHour AS ( PARTITION BY productId ORDER BY rowtime RANGE INTERVAL ‘1‘ HOUR PRECEDING)
The development plan has been completed
- SELECT on stream, WHERE, GROUP by, have, UNION all, ORDER by
- Floor and Ceil functions
- Monotonicity (monotonicity)
- Disabling streaming result Sets
Like what:
SELECT STREAM * FROM (VALUES (1, ‘abc‘)); ERROR: Cannot stream VALUES
Not completed
- Joins of streams and streams
- Joins of streams and tables
- View-based streaming
- The union All (stream merge) on the stream that contains the order by
- A relational query on a stream
- Window aggregation on a stream (sliding window and Cascade window)
- Ignore stream keywords in views and subqueries
- The order by on the stream cannot contain offset and limit
- Run time check if there is enough history data to query
- Quasi-monotonic – a mechanism is required to declare that once the data has been calculated, it is no longer updated, or after the calculation has been completed, if the latest results are refreshed, then updated. such as the watermark function.
- HOP and tumble functions, as well as auxiliary hop_start, Hop_end, Tumble_start, tumble_end functions
Reference:
Http://calcite.apache.org/docs/stream.html
Streaming SQL in calcite