The 92nd talk about the transformations and state management in sparkstreming

Source: Internet
Author: User
Tags shuffle

Contents of this issue:

Transforamtions in 1.SparkStreaming

State management in 2.SparkStreaming

I. Dstream is an abstraction above an RDD, Dstream and time together constantly triggering an instance of the RDD, and it can be said that our operation on Dstream initially defines the operation of the RDD, Just the interval that takes time is internalbatch to activate the template, to generate concrete instances of the RDD and the specific job.

Two. We encourage repartition, more to turn more partition into less partition, to defragment the flow of debris, and we are less encouraged to turn less partition into more partion because it involves shuffle.

Three. Dstream is a discrete stream, the discrete flow is not state, in addition to calculate each time interval to produce a job, we also need to calculate the last 10 minutes or half an hour, so this time we need to maintain this state. Background Spark provides a function Updatestatebykey (func) that specifically maintains this state, which is based on key, and we can maintain multiple states. Because you can make each time interval as a state, for example, every second as a state, I counted the past 10 minutes or half an hour. The update of the value is the Func function passed in.

Four. Transform

Transform (func)

Return a new DStream by applying a rdd-to-rdd function to every RDD of the source DStream. This can is used to does arbitrary RDD operations on the DStream.

The logic of programming is acting on the RDD

The transform operation allows arbitrary RDD and RDD operations to be applied to the dstream. He can make these rdd not easily exposed to DSTREAMAPI. For example, let two batch generate join operation without exposing to DSTREAMAPI, then you can easily use transform to do this. This will be very useful, for example, to be able to clean up real-time data by filtering out incoming data streams and pre-computed spam.

Five. Updatebykey

Updatestatebykey (func)

Return a new ' state ' DStream where the state for each key was updated by applying the given function on the previous state The key and the new values for the key. This can is used to maintain arbitrary state data for each key.

The Updaestatebykey operation allows you to maintain any state that is constantly updated through new information. You must follow two steps to use this function

1. Define a state: This state can be any type of data

2. Define a status update function: How to update a state using the old state and the new state generated from a data stream.

Six. Forecachrdd (func)

Foreachrdd (func)

The most generic output operator This applies a function, func, to each of the RDD generated from the stream. This function should push the external system, such as saving the "Rdd to files," or writing it over The network to a database. Note that the function func are executed in the driver process running the streaming application, and would usually With the RDD actions in it, would force the computation of the streaming RDDs.

Mapwithstate improves streaming state management performance by more than 10 times times

The function func in Foreachrdd (func) is acting on the last Rdd, which is the result of the RDD, if the RDD has no data, there is no need to operate, Foreachrdd () can write the data in the redis/hbase/database/ In the specific file, Foreachrdd is executed in the driver program, and Func is the action.

Seven. Updatestatebykey

Cogroupedrdd = Parentrdd.cogroup (Prevstaterddpartitioner)Staterdd = Cogroupedrdd.mappartitions (finalfuncpreservepartitioning)Some(Staterdd)

Cogroup is the bottleneck of performance, all the old data, the past data are cogroup operation, even if the new data Pairedrdd only one record, but also all the old records to be cogroup operation. It's quite time-consuming. In theory, only the corresponding key and history of the record corresponding to the key in the update operation, and it updates all, 99% of the time is wasted and consumed. Performance is very low. will also produce shuffle. The mapwithstate below only updates what you have to update, so it greatly improves performance.

Mapwithstate only need to update you have to update, there is no need to update all records, the official promotion of this API will be the flow of state management performance increased by more than 10 times times.

The 92nd talk about the transformations and state management in sparkstreming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.