In the previous blog, the examples of stock forecasts have been briefly explained, and some of the key technical points are summarized below.
1, Updatestatebykey
Since there is an alternative function in version 1.6, it is said that the efficiency is relatively high, so the author studies the usage of the function by the way.
Defmapwithstate[statetype mappedtype ] (Spec:statespec[k v statetype mappedtype]) : Mapwithstatedstream[k v statetypemappedtype] = { /span>
The
Above is the prototype of the function, receiving a Statespec object, in fact, is a package of Updatestatebykey related parameters. The object receives 4 type parameters, the type of the key value, the type of value, the type of state, and the type of mapped. Understanding this four types of parameters is also more critical, this is a little different from Updatestatebykey: k,v These two types of parameters do not need too much explanation; the type of state can be any type, Float, (Float,int), oneobject, etc. Mappedtype is a type of mapping result, which means that the returned type can be any type, which is slightly different from Updatestatebykey. Here is an example
/** Mapwithstate.function is the state pair (K,V) of each key to map * Each of the input (Stockmame,stockprice) key value pairs, using the state of each key to map, Returns the new results * Here the state is the last price of each stockname * with the input (Stockname,stockprice) StockPrice the last price in the state ( state.update function) * Mapping result is (StockName, (stockprice-last Price, 1)), of course, the mapping result can also be other values, for example (StockName, the direction of the last price change) * */ val updatepricetrend = (key:string, newprice:option[float],state:state[float]) + = { val lstprice:float = State.getoption (). Getorelse (Newprice.getorelse (0.0f)) state.update (Newprice.getorelse (0.0f)) //println (New SimpleDateFormat ("HH:mm:ss"). Format (new Date ()) + "-" +newprice.getorelse (0.0f) + "," +lstprice) (Key, (Newprice.getorelse (0.0f)-lstprice,1)) }
2, Reducebykeyandwindow
In the previous example, although the function was used, it was written on the basis of the official example, and was not well understood in terms of the specific usage of the function. The following is the optimized code
Val Reducefunc = (reduced: (Float,int), Newpair: (float,int)) + = { if (newpair._1 > 0) (reduced._1 + newpair._1, Reduced._2 + newpair._2) Else (reduced._1 + newpair._1, reduced._2-newpair._2) } val invreducefunc = (reduc Ed: (Float,int), Oldpair: (float,int)) = { if (oldpair._1 > 0) (reduced._1 + oldpair._1, reduced._2-oldpair._ 2) Else (reduced._1 + oldpair._1, reduced._2 + oldpair._2) } /** Every Slidelen Batchtime calculates an rdd for the past Windowlen (not including the current batch) batchtime * */ val windowedpricechanges = Stockprice.reducebykeyandwindow (Reducefunc,invreducefunc, Seconds (3),//windowlen Seconds (1)// Slidelen )
Two of these functions are critical: Reducefunc, Invreducefunc. Reducefunc is the calculation of the data entering the window, and Invreducefunc is the calculation of the data leaving the window. So how do you understand entering the window and leaving the window? To first understand the basic meaning of window functions, the concept of sliding window is shown.
As shown, all the rdd in a sliding window time period ((sliding window length) is merged to create the rddd corresponding to the windowed Dstream. There are two parameters for each window operation:
- Window length-the duration of the window (3 in the figure), the time span for sliding windows, refers to the past time interval contained in this window operation (3 batch interval are included in the graph to understand the time sheet -bit
- Sliding interval-the interval at which the window operation was performed (2 in the figure). (The frequency at which the window operation is performed, that is, the number of times per interval)
These parameters must is multiples of the batch interval of the source DStream (1 in the figure). This means that sliding window length and sliding interval are integer multiples of batch interval. Batch interval is passed in when constructing StreamingContext (1 in the figure)
Then, in the Time5, Reducefunc processing data is time4 and Time5;invreducefunc processing data is time1 and time2. Special special handling is needed here, window at time 5 to understand the last moment of time 5, if the time here is a second, then time 5 is actually the 5th second last moment, that is, the first 6 seconds. This will be explained in detail later in the blog post.
The key point is almost explained, Reducefunc's function is good to understand, the function of the first parameter reduced can be understood as time 3, the final result of the calculation, the second parameter is actually the data of the Times 4 and 5 (the function should be called multiple) So how are the two batches of time 4 and time 5 aggregated? is still called Reducefunc, that is, each of the two batches of data in each specific record in chronological order to call Reducefunc, in fact, is leftreduce. Invreducefunc.
OK, two key functions even if the explanation is clear, if there is unclear where, you can leave a comment, and finally attach the source git address: http://git.oschina.net/gabry_wu/BigDataPractice
Example of predicting stock movements based on spark streaming (II.)