Running source based on case-through Spark streaming flow computing framework

Last Update:2016-05-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

Spark Streaming+spark SQL Case Show
Based on the case running source of spark streaming

First, the case code elaborated:

　　Dynamically calculate the hottest product rankings in different categories of e-commerce, such as the hottest three phones in the mobile phone category, the hottest three TVs in the TV category, and more.

　　1. Case Run code:

Importorg.apache.spark.SparkConfImportOrg.apache.spark.sql.RowImportOrg.apache.spark.sql.hive.HiveContextImportOrg.apache.spark.sql.types. {integertype, StringType, Structfield, Structtype}Importorg.apache.spark.streaming. {Seconds, StreamingContext}Object ONLINETHETOP3ITEMFOREACHCATEGORY2DB {def main (args:array[string]) {/**
* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,* /Val conf=NewSparkconf ()//Create a Sparkconf objectConf.setappname ("onlinethetop3itemforeachcategory2db")//set the name of the application, which can be seen in the monitoring interface of the program run//Conf.setmaster ("Spark://master:7077 ")//At this point, the program is in the spark clusterConf.setmaster ("Local[6"))    //set the batchduration time interval to control the frequency of job generation and create portals for spark streaming executionVal SSC =NewStreamingContext (Conf, Seconds (5)) Ssc.checkpoint ("/root/documents/sparkapps/checkpoint") Val Userclicklogsdstream= Ssc.sockettextstream ("Master", 9999) Val Formatteduserclicklogsdstream= Userclicklogsdstream.map (Clicklog =(Clicklog.split ("") (2) + "_" + Clicklog.split ("") (1), 1))//val categoryuserclicklogsdstream = Formatteduserclicklogsdstream.reducebykeyandwindow ((v1:Int, v2:int) = V1 + V2,//(V1:int, v2:int) = V1-v2, Seconds, Seconds ())Val Categoryuserclicklogsdstream= Formatteduserclicklogsdstream.reducebykeyandwindow (_+_,      _-_, Seconds, Seconds (20) ) Categoryuserclicklogsdstream.foreachrdd {rdd= {      if(Rdd.isempty ()) {println ("No Data inputted!!!")      } Else{val Categoryitemrow= Rdd.map (Reduceditem ={val Category= Reduceditem._1.split ("_") (0) Val Item= Reduceditem._1.split ("_") (1) Val Click_count=reduceditem._2 Row (category, Item, Click_count)}) Val Structtype=Structtype (Array (Structfield ("Category", StringType,true), Structfield ("Item", StringType,true), Structfield ("Click_count", Integertype,true)) Val Hivecontext=NewHivecontext (Rdd.context) Val categoryitemdf=hivecontext.createdataframe (Categoryitemrow, Structtype) categoryitemdf.registertemptable ("Categoryitemtable") Val Reseltdatafram= Hivecontext.sql ("Select Category,item,click_count from (select Category,item,click_count,row_number ()" + "over (PARTITION by Category ORDER by Click_count DESC) rank from categoryitemtable) subquery "+" WHERE rank <= 3 ") Reseltdatafram.show () Val Resultrowrdd=reseltdatafram.rdd resultrowrdd.foreachpartition {partitionofrecords= {          if(partitionofrecords.isempty) {println ("This RDD was not NULL, but partition is null")          } Else {            //ConnectionPool is a static, lazily initialized pool of connectionsVal connection =connectionpool.getconnection () Partitionofrecords.foreach (record={val SQL= "INSERT into CATEGORYTOP3 (Category,item,client_count) VALUES ('" + Record.getas ("category") + "', '" +Record.getas ("Item") + "'," + Record.getas ("Click_count") + ")"Val stmt=connection.createstatement ();            Stmt.executeupdate (SQL); }) Connectionpool.returnconnection (connection)//return to the pool for future reuse}}}}} Ssc.start () Ssc.awaittermination ()}}
}

2, Case process Framework diagram:

Second, the source code analysis based on the case:

　　1. Build the Spark Configuration object sparkconf, set the runtime configuration information for the SPARK program:

　　2, build StreamingContext when pass sparkconf parameter inside create Sparkcontext:

　3. Created the StreamingContext: Also shows that spark streaming is an application on spark core

　　4. Checkpoint Persistence

5. Build Sockettextstream Get input source

　　　　01. Create socket to get input stream

　　　　02, Socketinputdstream inherit Receiverinputdstream, by building receiver to receive data

　　　　03. Create Socketreceiver

　　　　04, through receiver in the network to obtain relevant data

　　　　05. Data output

　　　　06. Generate Job Job

　　　　07. Generate RDD based on time interval, store data

　6. Streaming Start:

　7, Process Summary:

　　　　01, in the StreamingContext call the start method of the inside is actually starting the Jobscheduler start method, the message loop.

02. Jobgenerator and Receivertacker are constructed inside the start of Jobscheduler, and the Start method of Jobgenerator and Receivertacker is called:

- - Jobgenerator after the start will continue to generate a job according to Batchduration;
  - Receivertracker start receiver in spark cluster first (actually start receiversupervisor in executor);

03. After receiver receives the data, it is stored to executor via Receiversupervisor and sends the metadata information of the data to Receivertracker in driver.

04, in the Receivertracker will be through the Receivedblocktracker to manage the received meta-data information.

05, each batchinterval will produce a specific job, in fact, the job here is not in Spark core refers to the job, it is only based on Dstreamgraph and the RDD generated by the Dag just.

06, to run the job needs to be submitted to Jobscheduler, in the Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run (RDD-based action in the thread to trigger the real job run).

　　Note:

- Data from: Liaoliang (Spark release version customization)
- Sina Weibo:http://www.weibo.com/ilovepains

Running source based on case-through Spark streaming flow computing framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Running source based on case-through Spark streaming flow computing framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Running source based on case-through Spark streaming flow computing framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support