Running source based on case-through Spark streaming flow computing framework

Source: Internet
Author: User

Contents of this issue:

    • Spark Streaming+spark SQL Case Show
    • Based on the case running source of spark streaming

First, the case code elaborated:

  Dynamically calculate the hottest product rankings in different categories of e-commerce, such as the hottest three phones in the mobile phone category, the hottest three TVs in the TV category, and more.

  1. Case Run code:

Importorg.apache.spark.SparkConfImportOrg.apache.spark.sql.RowImportOrg.apache.spark.sql.hive.HiveContextImportOrg.apache.spark.sql.types. {integertype, StringType, Structfield, Structtype}Importorg.apache.spark.streaming. {Seconds, StreamingContext}Object ONLINETHETOP3ITEMFOREACHCATEGORY2DB {def main (args:array[string]) {/**
* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,* /Val conf=NewSparkconf ()//Create a Sparkconf objectConf.setappname ("onlinethetop3itemforeachcategory2db")//set the name of the application, which can be seen in the monitoring interface of the program run//Conf.setmaster ("Spark://master:7077 ")//At this point, the program is in the spark clusterConf.setmaster ("Local[6")) //set the batchduration time interval to control the frequency of job generation and create portals for spark streaming executionVal SSC =NewStreamingContext (Conf, Seconds (5)) Ssc.checkpoint ("/root/documents/sparkapps/checkpoint") Val Userclicklogsdstream= Ssc.sockettextstream ("Master", 9999) Val Formatteduserclicklogsdstream= Userclicklogsdstream.map (Clicklog =(Clicklog.split ("") (2) + "_" + Clicklog.split ("") (1), 1))//val categoryuserclicklogsdstream = Formatteduserclicklogsdstream.reducebykeyandwindow ((v1:Int, v2:int) = V1 + V2,//(V1:int, v2:int) = V1-v2, Seconds, Seconds ())Val Categoryuserclicklogsdstream= Formatteduserclicklogsdstream.reducebykeyandwindow (_+_, _-_, Seconds, Seconds (20) ) Categoryuserclicklogsdstream.foreachrdd {rdd= { if(Rdd.isempty ()) {println ("No Data inputted!!!") } Else{val Categoryitemrow= Rdd.map (Reduceditem ={val Category= Reduceditem._1.split ("_") (0) Val Item= Reduceditem._1.split ("_") (1) Val Click_count=reduceditem._2 Row (category, Item, Click_count)}) Val Structtype=Structtype (Array (Structfield ("Category", StringType,true), Structfield ("Item", StringType,true), Structfield ("Click_count", Integertype,true)) Val Hivecontext=NewHivecontext (Rdd.context) Val categoryitemdf=hivecontext.createdataframe (Categoryitemrow, Structtype) categoryitemdf.registertemptable ("Categoryitemtable") Val Reseltdatafram= Hivecontext.sql ("Select Category,item,click_count from (select Category,item,click_count,row_number ()" + "over (PARTITION by Category ORDER by Click_count DESC) rank from categoryitemtable) subquery "+" WHERE rank <= 3 ") Reseltdatafram.show () Val Resultrowrdd=reseltdatafram.rdd resultrowrdd.foreachpartition {partitionofrecords= { if(partitionofrecords.isempty) {println ("This RDD was not NULL, but partition is null") } Else { //ConnectionPool is a static, lazily initialized pool of connectionsVal connection =connectionpool.getconnection () Partitionofrecords.foreach (record={val SQL= "INSERT into CATEGORYTOP3 (Category,item,client_count) VALUES ('" + Record.getas ("category") + "', '" +Record.getas ("Item") + "'," + Record.getas ("Click_count") + ")"Val stmt=connection.createstatement (); Stmt.executeupdate (SQL); }) Connectionpool.returnconnection (connection)//return to the pool for future reuse}}}}} Ssc.start () Ssc.awaittermination ()}}
}

2, Case process Framework diagram:

  

Second, the source code analysis based on the case:

  1. Build the Spark Configuration object sparkconf, set the runtime configuration information for the SPARK program:

  

  2, build StreamingContext when pass sparkconf parameter inside create Sparkcontext:

  

  

 3. Created the StreamingContext: Also shows that spark streaming is an application on spark core

  

  4. Checkpoint Persistence

5. Build Sockettextstream Get input source

  

    01. Create socket to get input stream

    

    02, Socketinputdstream inherit Receiverinputdstream, by building receiver to receive data

    

    

    

    03. Create Socketreceiver

    

    04, through receiver in the network to obtain relevant data

    

    05. Data output

    

    06. Generate Job Job

    

    07. Generate RDD based on time interval, store data

    

    

 6. Streaming Start:

    

 7, Process Summary:

    01, in the StreamingContext call the start method of the inside is actually starting the Jobscheduler start method, the message loop.

02. Jobgenerator and Receivertacker are constructed inside the start of Jobscheduler, and the Start method of Jobgenerator and Receivertacker is called:

        • Jobgenerator after the start will continue to generate a job according to Batchduration;
        • Receivertracker start receiver in spark cluster first (actually start receiversupervisor in executor);

03. After receiver receives the data, it is stored to executor via Receiversupervisor and sends the metadata information of the data to Receivertracker in driver.

04, in the Receivertracker will be through the Receivedblocktracker to manage the received meta-data information.

05, each batchinterval will produce a specific job, in fact, the job here is not in Spark core refers to the job, it is only based on Dstreamgraph and the RDD generated by the Dag just.

06, to run the job needs to be submitted to Jobscheduler, in the Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run (RDD-based action in the thread to trigger the real job run).

    

  Note:

      • Data from: Liaoliang (Spark release version customization)
      • Sina Weibo:http://www.weibo.com/ilovepains

Running source based on case-through Spark streaming flow computing framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.