Contents of this issue:
- Spark Streaming+spark SQL Case Show
- Based on the case running source of spark streaming
First, the case code elaborated:
Dynamically calculate the hottest product rankings in different categories of e-commerce, such as the hottest three phones in the mobile phone category, the hottest three TVs in the TV category, and more.
1. Case Run code:
Importorg.apache.spark.SparkConfImportOrg.apache.spark.sql.RowImportOrg.apache.spark.sql.hive.HiveContextImportOrg.apache.spark.sql.types. {integertype, StringType, Structfield, Structtype}Importorg.apache.spark.streaming. {Seconds, StreamingContext}Object ONLINETHETOP3ITEMFOREACHCATEGORY2DB {def main (args:array[string]) {/**
* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,* /Val conf=NewSparkconf ()//Create a Sparkconf objectConf.setappname ("onlinethetop3itemforeachcategory2db")//set the name of the application, which can be seen in the monitoring interface of the program run//Conf.setmaster ("Spark://master:7077 ")//At this point, the program is in the spark clusterConf.setmaster ("Local[6")) //set the batchduration time interval to control the frequency of job generation and create portals for spark streaming executionVal SSC =NewStreamingContext (Conf, Seconds (5)) Ssc.checkpoint ("/root/documents/sparkapps/checkpoint") Val Userclicklogsdstream= Ssc.sockettextstream ("Master", 9999) Val Formatteduserclicklogsdstream= Userclicklogsdstream.map (Clicklog =(Clicklog.split ("") (2) + "_" + Clicklog.split ("") (1), 1))//val categoryuserclicklogsdstream = Formatteduserclicklogsdstream.reducebykeyandwindow ((v1:Int, v2:int) = V1 + V2,//(V1:int, v2:int) = V1-v2, Seconds, Seconds ())Val Categoryuserclicklogsdstream= Formatteduserclicklogsdstream.reducebykeyandwindow (_+_, _-_, Seconds, Seconds (20) ) Categoryuserclicklogsdstream.foreachrdd {rdd= { if(Rdd.isempty ()) {println ("No Data inputted!!!") } Else{val Categoryitemrow= Rdd.map (Reduceditem ={val Category= Reduceditem._1.split ("_") (0) Val Item= Reduceditem._1.split ("_") (1) Val Click_count=reduceditem._2 Row (category, Item, Click_count)}) Val Structtype=Structtype (Array (Structfield ("Category", StringType,true), Structfield ("Item", StringType,true), Structfield ("Click_count", Integertype,true)) Val Hivecontext=NewHivecontext (Rdd.context) Val categoryitemdf=hivecontext.createdataframe (Categoryitemrow, Structtype) categoryitemdf.registertemptable ("Categoryitemtable") Val Reseltdatafram= Hivecontext.sql ("Select Category,item,click_count from (select Category,item,click_count,row_number ()" + "over (PARTITION by Category ORDER by Click_count DESC) rank from categoryitemtable) subquery "+" WHERE rank <= 3 ") Reseltdatafram.show () Val Resultrowrdd=reseltdatafram.rdd resultrowrdd.foreachpartition {partitionofrecords= { if(partitionofrecords.isempty) {println ("This RDD was not NULL, but partition is null") } Else { //ConnectionPool is a static, lazily initialized pool of connectionsVal connection =connectionpool.getconnection () Partitionofrecords.foreach (record={val SQL= "INSERT into CATEGORYTOP3 (Category,item,client_count) VALUES ('" + Record.getas ("category") + "', '" +Record.getas ("Item") + "'," + Record.getas ("Click_count") + ")"Val stmt=connection.createstatement (); Stmt.executeupdate (SQL); }) Connectionpool.returnconnection (connection)//return to the pool for future reuse}}}}} Ssc.start () Ssc.awaittermination ()}}
}
2, Case process Framework diagram:
Second, the source code analysis based on the case:
1. Build the Spark Configuration object sparkconf, set the runtime configuration information for the SPARK program:
2, build StreamingContext when pass sparkconf parameter inside create Sparkcontext:
3. Created the StreamingContext: Also shows that spark streaming is an application on spark core
4. Checkpoint Persistence
5. Build Sockettextstream Get input source
01. Create socket to get input stream
02, Socketinputdstream inherit Receiverinputdstream, by building receiver to receive data
03. Create Socketreceiver
04, through receiver in the network to obtain relevant data
05. Data output
06. Generate Job Job
07. Generate RDD based on time interval, store data
6. Streaming Start:
7, Process Summary:
01, in the StreamingContext call the start method of the inside is actually starting the Jobscheduler start method, the message loop.
02. Jobgenerator and Receivertacker are constructed inside the start of Jobscheduler, and the Start method of Jobgenerator and Receivertacker is called:
-
-
- Jobgenerator after the start will continue to generate a job according to Batchduration;
- Receivertracker start receiver in spark cluster first (actually start receiversupervisor in executor);
03. After receiver receives the data, it is stored to executor via Receiversupervisor and sends the metadata information of the data to Receivertracker in driver.
04, in the Receivertracker will be through the Receivedblocktracker to manage the received meta-data information.
05, each batchinterval will produce a specific job, in fact, the job here is not in Spark core refers to the job, it is only based on Dstreamgraph and the RDD generated by the Dag just.
06, to run the job needs to be submitted to Jobscheduler, in the Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run (RDD-based action in the thread to trigger the real job run).
Note:
-
- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains
Running source based on case-through Spark streaming flow computing framework