Spark structured streaming Getting Started Programming guide

Source: Internet
Author: User
Tags foreach join joins old windows socket unsupported valid versions

Http://www.cnblogs.com/cutd/p/6590354.html

Overview

Structured streaming is an extensible, fault-tolerant streaming engine based on the spark SQL execution engine. Simulate streaming with a small amount of static data. With the advent of streaming data, the Spark SQL engine processes data sequentially and updates the results into the final table. You can use the Dataset/dataframe API on the spark SQL engine to process streaming data aggregation, event windows, and flow-to-batch connection operations. Finally, the structured streaming system is fast, stable, and end-to-end with exactly one guarantee, supporting fault-tolerant processing. Sample Samples

Import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = sparksession
  . Builder
  . AppName ("Structurednetworkwordcount")
  . Getorcreate ()

import spark.implicits._

Val lines = Spark.readstream
  . Format ("socket"). Option ("
  host", "localhost")
  . Option ("Port", 9999)
  . Load ()

//Split the lines into words
val words = Lines.as[string].flatmap (_.split (""))

//Generate Runni Ng Word Count
val wordcounts = Words.groupby ("value"). Count ()
val query = Wordcounts.writestream
  . Outputmode ("complete")
  . Format ("console")
  . Start ()

query.awaittermination ()
Programming Model

The key idea of structured flow is to treat real-time data stream as a continuous additional table basic concept

The input data is treated as an input table, and each data is treated as a new row of the input table.

"Output" is written to external storage in a different mode: Complete mode: Writes the entire Update table to the external storage, and the way the entire table is written is determined by the storage connector. Append mode: Only new rows appended to the result table since the last trigger are written to the external memory. This applies only to queries that do not change existing rows in the results table. Update mode: Only rows that have been updated in the result table since the last trigger are written to external storage (not yet available in Spark 2.0). Note that this differs from the full mode because this mode does not output unchanged rows. handling event time and latency data

Event time is the time that is embedded in the data itself. For many applications, you may want to manipulate this event time. For example, if you want to get the number of events generated per minute by an IoT device, you might need to use the time that the data was generated (that is, the event time in the data) instead of the time that spark received them. This event time is very natural in this model-each event from the device is a row in the table, and the event time is a column value in that row. This allows window-based aggregations (such as the number of events per minute) to be just special types of grouping and aggregation on even-numbered columns-each time window is a group, and each row can belong to more than one window/group. As a result, this aggregated query based on event time Windows can be consistently defined in a static dataset (for example, from a collected device event log) and on the data stream, making life easier for users.

In addition, the model naturally processes data based on its event time than expected. Because spark is updating the results table, when there is deferred data, it has full control over updating the old aggregations and clearing the old aggregations to limit the size of the intermediate state data. Because of Spark 2.1, we support watermarks, allow users to specify thresholds for late data, and allow the engine to clean up the old state accordingly. This will be explained in more detail later in the Window Actions section. Fault tolerance Semantics

Providing end-to-end semantics is one of the key goals behind the design of structured streams. To achieve this, we have designed a structured stream source, receiver, and execution engine to reliably track the exact progress of processing so that it can handle any type of failure by restarting and/or re-processing. Assume that each stream source has an offset (similar to the Kafka offset or kinesis sequence number) to track the read position in the stream. The engine uses checkpoints and pre-write logs to record the offset range of data being processed in each trigger. The stream receiver is designed to handle the power of the re-processing. With the use of both replay source and idempotent, structured streams can ensure end-to-end, one-time semantics at any fault. using the Dataframe and dataset APIs

Starting with Spark 2.0, dataframes and datasets can represent static, bounded data, and streaming, unbounded data. Similar to static datasets/dataframes, you can create flow dataframes/datasets from a stream source using a common entry point sparksession (Scala/java/python document). and apply the same operations as static dataframes/datasets to them. If you are unfamiliar with datasets/dataframes, it is highly recommended that you familiarize yourself with the Dataframe/dataset Programming Guide. creating data frame flows and data set flows

Streaming Dataframes can be created through the Datastreamreader interface (Scala/java/python docs) returned by Sparksession.readstream (). Similar to the read interface for creating static Dataframe, you can specify the details of the source-data format, mode, options, and so on. Data Source

In Spark 2.0, there are several built-in data sources: File Source: Reads a file written to the directory as a data stream. The supported file formats are Text,csv,json,parquet. See the documentation for the Datastreamreader interface for a list of updates, and the options that are supported for each file format. Note that the file must be placed atomically in a given directory, and in most file systems, it can be implemented through file move operations. Kafka Source: Pulls data from Kafka, supports Kafka broker versions 0.10.0 or higher. Get more information from the Kafka Integration Guide. Socket source (for testing): reads UTF8 text data from a socket connection. Listen for the server socket in the driver. Note that this should only be used for testing, as this does not provide an end-to-end fault tolerance guarantee

These examples generate an untyped streaming dataframes, which means that the Dataframe mode is not checked at compile time and is only checked at run time when the query is committed. Some operations, such as Map,flatmap, need to know the type at compile time. To do this, you can convert these untyped stream dataframes into a typed stream dataset using the same methods as static Dataframe. For more detailed information, see the SQL Programming Guide. In addition, more details about the supported streaming media sources will be discussed later in the documentation. schema inference and partitioning for data frame/dataset streams

By default, a structured stream of file-based sources requires that you specify patterns rather than relying on spark to automatically infer them. This restriction ensures that consistent patterns will be used for streaming queries even in the event of a failure. For temporary use cases, you can re-enable schema inference by setting Spark.sql.streaming.schemaInference to True.
When a subdirectory named/key = value/is present, partition discovery occurs, and the list is automatically recursive to those directories. If these are listed in the user-provided mode now, they will be populated by spark based on the path of the file being read. When the query starts, the directory that makes up the partitioning scheme must exist and must remain static. For example, you can add/data/year = 2016/when/data/year = 2015/exist, but changing the partition column is not valid (that is, by creating a directory/data/date = 2016-04-17/). operation on a streaming dataframes/datasets

You can apply various operations from a streaming dataframes/dataset-from untyped, SQL-like operations (such as SELECT,WHERE,GROUPBY) to typed RDD class operations (such as Map,filter,flatmap). For more detailed information, see the SQL Programming Guide. Let's take a look at some example operations that you can use. basic operations-selection, projection, aggregation

Case Class Devicedata (device:string, type:string, signal:double, Time:datetime)

val df:dataframe = ...//Streami ng DataFrame with IOT device data with schema {device:string, type:string, signal:double, time:string}
Val ds:d Ataset[devicedata] = Df.as[devicedata]    //streaming Dataset with IOT device data

//Select the devices which has Signal more than
df.select ("Device"). where ("Signal >")      //using untyped APIs   
ds.filter (_.signal >). Map (_.device)         //using typed APIs//Running Count of the number of updates for each

device type
DF.G Roupby ("type"). Count ()                          //Using untyped API

//Running average signal for each device type
import Org.apac He.spark.sql.expressions.scalalang.typed._
Ds.groupbykey (_.type). Agg (Typed.avg (_.signal))    //using Typed API
window actions on event time

The aggregation on the Sliding Event time window is done directly through a structured stream. The key idea of understanding window-based aggregation is very similar to grouping aggregation. In a grouping aggregation, the aggregated values, such as counts, are maintained for each unique value in the grouping column specified by the user. In the case of window-based aggregation, the aggregated values are maintained for each window that the row's event time falls into. Let's use illustrations to understand this.

Imagine our quick example being modified, and the stream now contains the rows and the time the rows were generated. We don't want to run the word count, but we calculate the word count in 10 minutes and update every 5 minutes. That is, the number of words in the word received between the 10-minute window 12:00-12:10,12:05-12:15,12:10-12:20 and so on. Note that 12:00-12:10 means that the data arrives after 12:00 but before 12:10. Now, consider a word received at 12:07. This word should increase the count of 12:00-12:10 and 12:05-12:15 corresponding to two windows. Therefore, the count is indexed by the grouping key (that is, the word) and the window (which can be computed from the event time).
The resulting table will look like this:

Because this window is similar to grouping, in code, you can use the GroupBy () and window () actions to represent a windowing aggregation. You can view the complete code for the following example in Scala/java/python. processing of deferred data and water level lines

Now consider what happens if one of the events arrives late for the application. For example, words generated at 12:04 (that is, event time) can be received by the app at 12:11. The application should use time 12:04 instead of 12:11 to update the old Count of window 12:00-12:10. This happens naturally in our window-based groupings-structured streams can maintain the intermediate state of partial aggregation for a long time, allowing late data to correctly update the aggregation of old windows, as shown below.

However, the number of days to run this query, the system must bind its cumulative amount of intermediate in-memory state. This means that the system needs to know when the old aggregations can be removed from the in-memory state, because the application will no longer receive deferred data for that aggregation. To achieve this, in spark 2.1, we introduced a watermark that allows our engine to automatically track the current event time in the data and attempt to clean up the old state accordingly. You can define the watermark for the query by specifying the event time column and the threshold value for the estimated data latency based on the event time. For a particular window that starts at time T, the engine remains in state and allows late data to update the state until (the maximum event time that is seen by the engine-late threshold > T). In other words, late data within the threshold is aggregated, but data that is later than the threshold is discarded. Let's use an example to understand this. We can use Withwatermark () to easily define the watermark in the example above, as shown below.

Import Spark.implicits._

val words = ...//streaming DataFrame of schema {timestamp:timestamp, word:string}

/ /group the data by window and word and compute the count of each Group
val windowedcounts = words
    . Withwatermark ( "Timestamp", "ten minutes")
    . GroupBy (
        window ($ "timestamp", "Ten Minutes", "5 minutes"),
        $ "word")
    . Count ()

In this example, we define the value of the query's watermark to the column "timestamp", and also define "10 minutes" as the threshold for allowing data timeouts. If this query is run in the Append output mode (discussed later in the output Mode section), the engine will track the current event time from column "timestamp" and wait for an additional "10 minutes" of the event time before finalizing the window count and adding them to the results table. This is an example.

As shown in the figure, the maximum event time tracked by the engine is a blue dashed line, and the watermark set at the beginning of each trigger (maximum event time-' 10 minutes ') is the red one. For example, when the engine observes the data (12:14, dog), it sets the watermark of the next trigger to 12:04. For window 12:00-12:10, the partial count remains internal, while the system is waiting for deferred data. After the system discovers the data (ie (12:21,owl)) so that the watermark exceeds 12:10, the partial count is finalized and appended to the table. This count will not be changed further because all more than 12:10 of the "Too Late" data will be ignored.

Note that in append output mode, the system must wait for the latency threshold time to output the aggregation of the window. This may not be ideal if the data may be very late, (for example, 1 days), and you want a partial count without waiting for the day. In the future, we will add the update output mode, which will allow each update aggregation to be written to each trigger.

The condition of the watermark used to clear the aggregation state it is important to note that the watermark should meet the following criteria to clear the status in the aggregate query (starting with Spark 2.1, which will change in the future). The output mode must be appended. Completion mode requires all aggregated data to be preserved, so you cannot use watermarks to remove intermediate states. For a detailed description of the semantics of each output mode, see the Output mode section. Aggregations must have an event when the column, or the event window is on the column. Withwatermark must be called on the same column as the timestamp column used in the aggregation. For example, Df.withwatermark ("Time", "1 min"). GroupBy ("Time2"). Count () is not valid in Append output mode because the watermark is defined on a column different from the aggregation column. Withwatermark must be called before the aggregation to use the watermark details. For example, Df.groupby ("Time"). Count (). Withwatermark ("Time", "1 min") is not valid in Append output mode. JOIN Operation

The stream dataframes can be connected to a static dataframes to create a new stream dataframes. Here are a few examples.

Val staticdf = Spark.read. ...
Val streamingdf = Spark.readstream. ...

Streamingdf.join (STATICDF, "type")          //inner equi-join with a static DF
streamingdf.join (STATICDF, "type", " Right_join ")  
Operations not supported

However, note that all operations that apply to static dataframes/datasets are not supported in the streaming dataframes/data set. While some of these unsupported operations will be supported in future versions of Spark, there are some that are largely difficult to implement on stream data. For example, the input stream dataset does not support sorting because it needs to track all the data that is received in the stream. Therefore, this is fundamentally difficult to implement effectively. Starting with Spark 2.0, some of the unsupported operations are as follows: multiple stream aggregation (that is, the aggregation chain on stream DF) is not supported on the stream dataset. Limit and fetch top n rows are not supported on stream datasets. Different operations are not supported for streaming datasets. The sort operation supports streaming datasets only after aggregation in full output mode. Conditions support outer joins between streaming and static datasets. A right outer join with a flow dataset that does not support a full outer connection with a stream data set that does not support left outer joins and the right side of the stream dataset does not support the left-hand stream dataset is not supported for any type of connection between the two stream datasets.

In addition, there are some dataset methods that cannot be used for streaming datasets. These are the operations that will run the query immediately and return the result, which is meaningless for the stream dataset. Instead, these features can be done by explicitly starting a stream query (see the next section). COUNT ()-Cannot return a single count from the stream dataset.
Instead, use Ds.groupBy.count () to return the stream dataset that contains the run count. foreach ()-Instead, use Ds.writeStream.foreach (...). ) (see next section). Show ()-instead, use the console sink (see the next section).

If you try any of these actions, you will see a analysisexception such as "Operation XYZ is not supported with stream dataframes/datasets". Start streaming Queries

Once the final result dataframe/dataset is defined, the rest is to start the flow calculation. To do this, you must use the Datastreamwriter (Scala/java/python document) that is returned by Dataset.writestream (). You must specify one or more of the following in this interface. Output receiver Details: Data format, location, output mode: Specifies the content written to the output receiver. Query name: (optional) specify a unique name for the query to identify. Trigger interval: Optionally specify a trigger interval. If not specified, the system checks the availability of new data as soon as the previous processing is complete. If the trigger time is missed because the previous processing has not been completed, the system will attempt to fire at the next trigger point rather than immediately after processing is complete. Checkpoint location: For some output receivers that can guarantee end-to-end fault tolerance, specify where all checkpoint information will be written to the system. This should be a directory in the HDFs-compatible fault-tolerant file system. The semantics of the checkpoint are discussed in more detail in the next section. Output Mode

There are several types of output modes: Additional mode (default)-This is the default mode, where only new rows that have been added to the results table since the last trigger are output to the sink. This only supports queries that are not changed from rows that are added to the results table. Therefore, this mode guarantees only one output per line (assuming fault-tolerant accommodation). For example, only queries such as Select,where,map,flatmap,filter,join will support append mode. Completion mode-The entire result table is output to the receiver after each trigger. Aggregation queries support this option. Update mode-(not available in Spark 2.1) only rows that have been updated since the last trigger in the resulting table will be output to the sink. More information will be added in a future release.

Different types of stream queries support different output modes. Here is the compatibility matrix:

Query Type supported output modes Note
Non-aggregated queries Support for full mode Because it is not possible to keep all the data in the results table.
Aggregation with aggregation Aggregation in a Watermark event time aggregation Attached, completely The attach mode uses a watermark to remove the old aggregation state. However, the output of the windowing aggregation is delayed by the late thresholds specified in Withwatermark (), such as schema semantics, after the end table, only after the end table (after the watermark is crossed) can the row be added once. For more information, see the Deferred data section. The completion mode does not delete the old aggregation state because the schema is defined to preserve all data in the resulting table.
Other aggregations Completely Full attach mode is not supported because aggregations can be updated, thus violating the semantics of this pattern. The completion mode does not delete the old aggregation state because the schema is defined to preserve all data in the resulting table.
Output Receivers

There are several types of built-in output receivers: File Sinks-Store the output in a directory. Foreach Sink-runs arbitrary calculations on the records in the output. For more information, see the sections that follow. Console receiver (for debugging)-prints the output to the console/stdout each time a trigger is available. This should be used for debugging purposes on low data volumes, because the entire output is collected and stored in the driver's memory after each trigger. Memory sink (for debugging)-the output is stored in memory as a memory table. Supports additional and complete output modes. This should be used for debugging purposes on low data volumes, because the entire output is collected and stored in the driver's memory after each trigger.

Here is the table for all sinks and the corresponding settings:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.