SparkStreaming basic concepts

Source: Internet
Author: User
Keywords SparkStreaming SBT Maven project
Tags apache api application apply available in based basic basic concepts

First, the association

Like Spark, Spark Streaming can also take advantage of maven repositories. To write your own Spark Streaming program, you need to import the following dependencies into your SBT or Maven project org.apache.spark spark-streaming_2.10 1.2 In order to obtain from sources not provided in the Spark core API, such as Kafka, Flume and Kinesis Data, we need to add the relevant module spark-streaming-xyz_2.10 to dependencies Here are some of the popular components kafka: spark-streaming-kafka_2.10 flume: spark-streaming-flume_2.10 Kinesis: spark-streaming-kinesis-asl_2 .10 Twitter: spark-streaming-twitter_2.10 ZeroMQ: spark-streaming-zeromq_2.10 MQTT: spark-streaming-mqtt_2.10

Second, initialize StreamingContext

In order to initialize the Spark Streaming program, a StreamingContext object must be created, which is the main entry point for all Streaming Streaming operations. A StreamingContext object can be created using a SparkConf object. import org.apache.spark._import org.apache.spark.streaming._ val conf = new SparkConf (). setAppName (appName) .setMaster (master) val ssc = new StreamingContext (conf, Seconds Your application shows the names on the cluster UI, master is a Spark, Mesos, YARN cluster URL or a special string "local [*]" which means that the program runs in native mode. When the program is running in a cluster, you do not want to hard-code the program in the program, but instead want to start the application with spark-submit and get the value of master from spark-submit. For local testing or unit testing, you can pass "local" strings to run Spark Streaming in the same process. Note that it internally creates a SparkContext object that you can access through the ssc.sparkContext object. When a context is defined, you have to do the following: 1. Define the input source 2. Prepare the flow calculation instructions 3. Receive and process the data using the streamingContext.start () method 4. Processing continues , Until the streamingContext.stop () method is called for attention: 1. Once a context has been started, no new stream operator can be created or added to the context. 2. Once a context has stopped, it can not be restarted 3. In the JVM, only one StreamingContext is active at a time 4. Calling the stop () method on the StreamingContext also closes the SparkContext object. If you only want to close only the 5.StreamingContext object, set the optional argument to stop () to false. A SparkContext object can be reused to create multiple StreamingContext objects with the precondition that the previous 7.StreamingContext is closed before the StreamingContext is created Do not close the SparkContext).

Third, the discrete flow (DStreams)

Discrete Streams or DStreams is a basic abstraction provided by Spark Streaming that represents a continuous stream of data. It is either the input stream obtained from the source or the processed data stream generated by the input stream through the conversion operator. Internally, DStreams consists of a series of consecutive RDDs. Each RDD in DStreams contains data that defines a time interval. Any operation on DStreams translates into an operation on the underlying RDD implied by DStreams.

Fourth, enter DStreams and receivers

Enter DStreams Represents the DStreams that get the input data stream from the data source. In the SparkStreaming quick example, lines represent the input DStream, which represents the data flow obtained from the netcat server. Each input stream, DStream, is associated with a Receiver object that takes the data from the source and stores the data in memory for processing. Enter DStreams for the original data stream obtained from the data source. Spark Streaming has two types of data sources: Basic sources: These sources are directly available in the StreamingContext API. Such as file systems, socket connections, Akka's actors, and more. Advanced sources: These sources include Kafka, Flume, Kinesis, Twitter, and more. They need to be used by extra classes. Note that if you want to create multiple input DStreams in parallel in a streaming application to receive multiple data streams, you can create multiple input streams. It will create multiple Receivers to receive multiple data streams simultaneously. However, receiver runs as a long-running task in a Spark worker or executor. Therefore, it occupies a core that is one of the cores allocated to the Spark Streaming application assigned to the Spark Streaming application. So, it is important to allocate enough cores for the Spark Streaming application (or threads if it is natively run) to process the received data and to run the receiver. A few points to note: If the number of cores assigned to the application is less than or equal to the number of DStreams or receivers entered, the system can only receive data and can not process them. When running locally, if your master URL is set to "local" then there is only one core run task. This is not enough for the program because DStream, which is the receiver's input, will take up this core so that there is no remaining core to process the data. Basic Sources: File Streams: Read data from any file system compatible with the HDFS API. A DStream can be created as follows. Points to note: 1. All files must have the same data format 2. All files must be created in the `dataDirectory` directory, and the files are automatically moved and renamed to the data directory. 3. Once moved, the file must be modified. So if the file is persisted with additional data, the new data will not be read. Streams based on custom actors: DStream can be created by streaming streaming data obtained by Akka actors using the streamingContext.actorStream (actorProps, actor-name) method. RDD Queue as Data Flow: To test the Spark Streaming application with test data, one can call the streamingContext.queueStream (queueOfRDDs) method to create DStreams based on the RDD queue. Each RDD pushed to the queue is treated as a batch of DStream data and processed like a stream. Advanced Sources: These sources require non-Spark library interfaces, and some of them also require complex dependencies (such as kafka and flume). Custom Sources: In Spark 1.2, these sources are not supported by the Python APIs. The input DStream can also be created from a custom source. All you need to do is implement a user-defined receiver that can receive data from a custom source and push the data into Spark. Receiver Reliability There are two types of data sources based on reliability. Sources (like kafka, flume) allow. If the system that gets data from these reliable sources can correctly answer the data it receives, it can ensure that it will not lose data under any circumstances. Thus, there are two types of receiver: Reliable Receiver: A reliable receiver correctly answers a reliable source, the data has been received and is correctly copied into Spark. Unreliable Receiver: These receivers do not support answering. Even for a reliable source, developers may implement an unreliable receiver that does not respond correctly.

Five, DStream conversion (transformation)

Like RDD, transformation allows data from the input DStream to be modified. DStreams supports many of the transformation operators available in the RDD. Some commonly used operators are as follows: map (func): Returns a new DStream filter (func) using the function func to process each element of the original DStream: Returns a new DStream that contains only the source DStream that satisfies the function func Repartition (numPartitions): changes the level of parallelism for this DStream by creating more or fewer partitions union (otherStream): Returns a new DStream that contains the union element count () of the source DStream and otherStream : Returns a new DStream reduce (func) containing single-element RDDs by counting the number of elements per RDD in the source DStream: Assembles the elements of each RDD in the source DStream with the function func and returns a singleton containing a singleton (single-element) RDDs new DStream. The function should be associative so that the computation can be parallelized. CountByValue (): This operator applies to a DStream of element type K, returning a new (DStream) pair of (K, long) pairs, with the value of each key in the original DStream The frequency in each RDD. join (otherStream, [numTasks]): returns a new pair containing (K, (V, W)) when applied to two DStreams (one contains a pair of (K, V) DStream transform (func): Creates a new DStream by applying the RDD-to-RDD function to each RDD of the source DStream. This can be used in any RDD operation in DStream updateStateByKey (func): Returns the DStream of a new "state" using the given function to update the state of DStream. Focus on the following two operators: updateStateByKey: operation allows continuous updating with new information while maintaining any state. You need to use it in two steps: 1. Define state - state can be any data type 2. Define state update function - How to update status with the state before update and new value from input stream Example: val sparkConf = new SparkConf (). setAppName ("StatefulNetworkWordCount") // Create the context with a 1 second batch size val ssc = new StreamingContext (sparkConf, Seconds (1)) ssc.checkpoint (".") // Initial state RDD for mapWithState operation val initialRDD = ssc.sparkContext.parallelize (List (("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip: port and count the // words in input stream of \ val_id = ssc.socketTextStream (args (0), args (1) .toInt) val words = lines.flatMap (_. split ("")) val wordDstream = words .map (x => (x, 1)) // Update the cumulative count using mapWithState // This will give a DStream made of state (which is the cumulative count of the words) val mappingFunc = (word: String, one: Option [Int], state: State [Sum]) => {val sum = one.getOrElse (0) + state.getOption.getOrElse (0) val output = (word, sum) state.update (sum) output} val stateDstream = wordDstream.mapWithState (StateSpec. function (mappingFunc) .initialState (initialRDD)) stateDstream.print () ssc.start () ssc.awaitTermination ()} Transform operation: The transform operation (and its variants such as transformWith) allow any RDD-to-RDD function. It can be used to apply any RDDoperation that is not exposed in the DStream API. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the ability to connect each batch and another dataset in a data stream is not provided in the DStream API; however, you can simply use the transform method. If you want to clean up real-time data by connecting incoming data streams with pre-computed spam messages and then passing them, you can do this as follows: val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD (...) // RDD containing spam informa tion cleanedDStream = wordCounts.transform (rdd => {rdd.join (spamInfoRDD) .filter (...) // join data stream with spam information to do data cleaning ...})

Six, DStreams on the output operation

Output operations allow DStream operations pushed to external systems such as databases, file systems, and more. Because the output operation is actually allowing the external system to consume the converted data, the actual operation they trigger is the DStream conversion. Currently, the following output operations are defined: print (): Prints the top 10 elements in each batch of DStream data. This is useful for both development and debugging. Call pprint () in the Python API. saveAsObjectFiles (prefix, [suffix]): saveAsTextFiles (prefix, [suffix]): Saves the contents of DStream as a text file. File names for each batch interval file are generated based on prefix and suffix. "prefix-TIME_IN_MS [.suffix]" saveAsHadoopFiles (prefix, [suffix]): Saves the contents of DStream as a hadoop file. File names for each batch interval file are generated based on prefix and suffix. "prefix-TIME_IN_MS [.suffix]" is not available in the Python API. foreachRDD (func): The most common output operation to apply the func function on each RDD generated from the stream. This function should push each RDD's data to an external system, such as saving the RDD to a file or writing it over the network to the database. It should be noted that the func function is executed in the driver, and usually RDD action in which to promote the RDD flow calculation. Using foreachRDD design patterns dstream.foreachRDD is a powerful primitive that sends data to external systems. However, it is important to understand how to use this primitive correctly and effectively. Here's how to avoid common mistakes. Often write data to external systems need to create a connection object (for example, to a remote server's TCP connection), use it to send data to the remote system. To do this, a developer may inadvertently create a connection object in the Spark driver, but in Sparkworker try to call this connection object and save it to the RDD as follows: dstream.foreachRDD (rdd => {val connection = createNewConnection () // executed at the driver rdd.foreach (record => {connection.send (record) // executed at the worker}))) This is incorrect because it requires serializing the connection object first and then pulling it from the driver Send to the worker. Such a connection object can not be transferred between machines. It may appear as a serialization error (connection object can not be serialized) or initialization error (connection object should be initialized in the worker) and so on. The correct solution is to create the connection object in the worker. However, this creates another common mistake - creating a connection object for each record. For example: dstream.foreachRDD (rdd => {rdd.foreach (record => {val connection = createNewConnection () connection.send (record) connection.close ()})}) Usually, creating a connection object has resources and time expenditure. Therefore, creating and destroying connection objects for each record can result in significant overhead, significantly reducing overall system throughput. A better solution is to use the rdd.foreachPartition method. Create a connection object for the RDD partition, using this two objects to send all the records in the partition. dstream.foreachRDD (rdd => {rdd.foreachPartition (partitionOfRecords => {val connection = createNewConnection () partitionOfRecords.foreach (record => connection.send (record)) connection.close ()})}) This will connect the object The creation of overhead allocated to all the records on the partition. Finally, you can make further optimizations by reusing connection objects across multiple RDDs or batch data. Developers can keep a static pool of connected objects and reuse objects in the pool to push multiple batches of RDDs to external systems for further cost savings. dstream.foreachRDD (rdd => {rdd.foreachPartition (partitionOfRecords => {// ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection () partitionOfRecords.foreach (record => connection.send (record)) ConnectionPool.returnConnection (connection) // return to the pool for future reuse})}) Note that the connection object in the pool should be deferred as needed, and automatically time out after some time has passed. This gets the most efficient way to generate data to external systems. Other things to note: Output operations operate DStreams by lazy execution, just as RDD actions operate RDDs by lazy execution. Specifically, the RDD actions and DStreams output operations receive data. So if your application does not have any output or is used to output dstream.foreachRDD (), but no RDD action is in dstream.foreachRDD (), then nothing will be executed. The system just accepts the input and discards them. By default, DStreams output operations are performed on a time-of-day basis, and they are executed sequentially, in the order they are defined by the application.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.