Spark (10)--Spark streaming API programming

Source: Internet
Author: User

The spark version tested in this article is 1.3.1

Spark Streaming programming Model:

The first step:
A StreamingContext object is required, which is the portal to the spark streaming operation, and two parameters are required to build a StreamingContext object:
1, Sparkconf object: This object is configured by the Spark program settings, such as the master node of the cluster, program name and other information
2. Seconds object: This object sets how often StreamingContext reads the data stream

Step Two:
After constructing the Portal object, the method of directly invoking the portal reads the data streams transmitted in various ways, such as: Socket,hdfs and so on. and converts the data into Dstream objects for uniform operation

Step Three:
The Dstream itself is an RDD sequence that streaming after the data stream is accepted, and each slice is an rdd, and the RDD is eventually packaged into a Dstream object for uniform operation. In this step, the business processing of the data

Fourth Step:
Call start and awaittermination of the portal object to begin reading the data stream

WordCount word counts are completed using different spark streaming processing methods, respectively

HDFs file test

 Object hdfswordcount {  defMain (args:array[string]) {//Parameter Settings    if(Args.length <2) {System.err.println ("Usgae: <spark master> ) System.exit (1)    }//First step: Create StreamingContext entry    Valsparkconf =NewSparkconf (). Setmaster (Args (0). Setappname ("Hdfswordcount")Valstreaming =NewStreamingContext (Sparkconf,seconds (Ten))///Step two: Call Textfilestream to read a file of the specified path    Valdata = Streaming.textfilestream (args (1))//Third step, data service processing    //Use Flatmap to press a dstream after the data map is divided into a    ValWords = Data.flatmap (_.split (" "))ValWordCount = Words.map (x = (x,1). Reducebykey (_+_) Wordcount.print ()//Fourth stepStreaming.start () streaming.awaittermination ()}

Socket Data Flow Test

Object Networkwordcount {def main (args:array[string]) {if (args. Length<3) {System. Err. println("Usage: <spark master> ) System. Exit(1)} val sparkconf = new sparkconf (). Setmaster(Args (0)). Setappname("Networkwordcount") Val streaming = new StreamingContext (Sparkconf,seconds (Ten))//Parameters:1, host name;2, port number;3, storage level val data = streaming. Sockettextstream(Args (1), args (2). ToInt, Storagelevel. MEMORY_and_disk_ser) Val words = lines. FlatMap(_. Split(" ")) Val WordCount = words. Map(x= (x,1)). Reducebykey(_ + _) WordCount. Print() streaming. Start() streaming. Awaittermination()}

As can be seen, for the same business processing logic, different data sources as long as the call different methods to receive, converted to Dstream after the processing steps are identical

The following code, when mated with the test socket data, executes the jar package with the Java command, passing in Parameters: 1, port number, 2, frequency of data generation (MS)
Can generate data on the specified port to provide spark streaming receive

 PackageStreamingImportJava.net.ServerSocketImportJava.io.PrintWriter Object Logger {  defGeneratecontent (index:int): String = {ImportScala.collection.mutable.ListBufferValCharList = Listbuffer[char] () for(I <- $To -) {charList + = I.tochar}ValChararray = Charlist.toarray Chararray (index). toString ()}defindex = {ImportJava.util.RandomValRan =NewRandom Ran.nextint (7)  }defMain (args:array[string]): Unit = {if(Args.length! =2) {System.err.println ("Usage:<port> <millisecond>") System.exit (1)    }ValListener =NewServerSocket (Args (0). ToInt) while(true) {ValSocket = Listener.accept ()NewThread () {Override defRun = {println ("Get client connected from:"+ socket.getinetaddress)Valout =NewPrintWriter (Socket.getoutputstream (),true) while(true) {Thread.Sleep (args (1). Tolong)ValContent = generatecontent (index) println (content) out.write (content +' \ n ') Out.flush ()} socket.close ()}}.start ()}}

In the above example, seconds (10) is used in the text.
Which means that data is processed every 10 seconds.
The result of the first 10-second processing will not affect the second 10-second
But sometimes we need to do a pass statistic, what about the data from the previous 10-second phase?

A Updatestatebykey method is used here that saves the state of the last calculated data for use in the next calculation.
On the code:

Object Statefulwordcount {def main (args:array[string]) {if(args.length<3) {System.err.println ("Usage: <spark master> ) System.Exit(1)}//define an anonymous function and assign a value to Updatefunc//The function is a parameter of the Updatestatebykey method that requires an anonymous parameter to be passed in and the parameter format isValues: SeQ[int], State: Option[int]//WhereValuesIs the current data, StateRepresents the previous data//The function of this anonymous function is to put the individualTenThe result of the second phase is cumulative rollup val Updatefunc = (Values: SeQ[int], State: option[int]) = = {Val now =Values. Foldleft (0)(_+_) Val old = State. Getorelse (0) Some (now + old)} val conf = new sparkconf (). Setappname ("Statefulwordcount"). Setmaster (Args (0) Val streaming = new StreamingContext (conf, Seconds (Ten)//checkpoint will put the data on the specified path, this operation is necessary, in order to protect the data, if not set will report an exception Streaming.checkpoint (".") Val lines = Streaming.sockettextstream (args (1), args (2). ToInt, Storagelevel.memory_and_disk_ser) val words = Lines.flatmap (_.Split(" ")) Val Worddstream = words.Map(x =(x,1)//Here The Updatefunc is passed into Val statedstream = Worddstream.updatestatebykey (updatefunc) statedstream.Print() Streaming.start () streaming.awaittermination ()}

There is also a window concept in spark streaming, which is the sliding form

is an explanation given in the official documentation:

Use the sliding form to set two specified parameters:
1. Form length
2. Sliding time
For example, setting a form with a length of 5 and a sliding time of 2 means that the data flow in the last 5 seconds is processed every 2 seconds
Such processing can be applied to the hottest search terms such as Weibo statistics
Count the hottest search terms in the last 5 seconds every 2 seconds

Statistics The hottest search term instance code:

Object Windowwordcount {def main (args:array[string]) {if (args. Length<3) {System. Err. println("Usage: <spark master> ) System. Exit(1)} val conf = new sparkconf (). Setappname("Windowwordcount"). Setmaster(Args (0) Val streaming = new StreamingContext (conf, Seconds (args) (3). ToInt)//checkpoint will put the data on the specified path, this operation is necessary, in order to protect the data, if not set will report an exception streaming. Checkpoint(".") Val lines = streaming. Sockettextstream(Args (1), args (2). ToInt, Storagelevel. MEMORY_only) Val words = lines. FlatMap(_. Split(" ") The format of the data after the//map operation is://(A,1) (b,1)... (N,1Format//Call Reducebykeyandwindow instead of normal reducebykey//The last two parameters are form length and sliding time val wordCount = words. Map(x= (x,1)). Reducebykeyandwindow(_ + _, _-_, Seconds (args) (4). ToInt), Seconds (args (5). ToInt)//To sort the results in descending order//Because Dstream itself does not have some operations on the RDD, calling the transform method allows some operations of the RDD (such as Sortbykey, etc.) to function on top of it, and returns a Dstream object Val sorted = WordCount. Map{Case (char, count) = = (count, char)}. Transform(_. Sortbykey(false)). Map{Case (count, char) = = (char, count)} sorted. Print() streaming. Start() streaming. Awaittermination()  }}

There are two ways to use Reducebykeyandwindow:
1, Educebykeyandwindow (_ + _, Seconds (5), Seconds (1))
2, Reducebykeyandwindow (_ + , -_, Seconds (5), Seconds (1))

See the difference between the two:

The first is simple, crude, direct accumulation.
And the second way is more elegant and efficient.
For example, calculate the cumulative data for t+4 now
The first way is directly from t+...+ (T+4)
The second treatment is that, with the computed (t+3) data Plus (T+4) data, in the minus (t-1) of the data, you can get the same results as the first way, but the intermediate multiplexed three data (t+1,t+2,t+3)

The above is a simple use of the Spark streaming API

Spark (10)--Spark streaming API programming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.