5th lesson: A case-based class runs through spark streaming flow computing framework running source

Last Update:2016-05-12 Source: Internet

Author: User

Tags samsung android

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Contents of this issue:

1 Online Dynamic Computing classification the most popular products case review and demonstration

2 Case-based penetration Spark Streaming the operating source

First, the case code

Dynamically calculate the hottest product rankings in different categories of e-commerce, such as the hottest three phones in the phone category, the hottest three TVs in the TV category, etc.

Package Com.dt.spark.sparkstreamingimport Org.apache.spark.SparkConfimport Org.apache.spark.sql.Rowimport Org.apache.spark.sql.hive.HiveContextimport Org.apache.spark.sql.types. {integertype, StringType, Structfield, Structtype}import org.apache.spark.streaming. {Seconds, streamingcontext}/**  * use Spark Streaming+spark SQL to dynamically calculate the hottest rankings in different categories of e-commerce online, such as the three most popular mobile phones and TVs under this category   * The most popular three kinds of TV, this example in the actual production environment has very significant significance;  * @author DT Big Data DreamWorks   * Sina Weibo:http://weibo.com/ilovepains/  * Implementation technology: Spark Streaming+spark SQL, the reason spark streaming can use ML, SQL, GRAPHX and other functions because there are foreachrdd and transform  * and other interfaces, These interfaces are actually based on the RDD, so with the RDD as the cornerstone, you can use all of the other spark functions directly, just as simple as calling the API directly.   *   Suppose that the format of the data here: User item category, for example Rocky Samsung android  */object onlinethetop3itemforeachcategory2db {def main (args:array[string]) {/** * 1th step: Create a Spark Configuration object sparkconf, set the spark program to run Configuration information, * For example, by using Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if the setting * is local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g. * only 1G    Memory) for Beginners   * */Val conf = new sparkconf ()//Create sparkconf Object Conf.setappname ("onlinethetop3itemforeachcategory2db")//settings should With the name of the program, the program runs in the monitoring interface can see the name//Conf.setmaster ("spark://master:7077")//At this time, the program in the Spark cluster Conf.setmaster ("local[6]")//Set BA Tchduration time interval to control the frequency of job generation and to create a spark streaming execution portal Val SSC = new StreamingContext (conf, Seconds (5)) Ssc.checkpoint (" /root/documents/sparkapps/checkpoint ") Val Userclicklogsdstream = Ssc.sockettextstream (" Master ", 9999) Val formatted Userclicklogsdstream = Userclicklogsdstream.map (Clicklog = Clicklog.split ("") (2) + "_" + Clicklog.split ("") ( 1), 1)//Val Categoryuserclicklogsdstream = Formatteduserclicklogsdstream.reducebykeyandwindow ((v1:Int, V2:int) =&gt ; V1 + v2,//(v1:int, v2:int) = V1-v2, Seconds, Seconds (()) Val Categoryuserclicklogsdstream = formatted Userclicklogsdstream.reducebykeyandwindow (_+_, _-_, Seconds (), Seconds) Categoryuserclicklogsdstream.foreach      Rdd {Rdd = {if (Rdd.isempty ()) {println ("No data inputted!!!")          } else {val categoryitemrow = rdd.map (Reduceditem = {val category = Reduceditem._1.split ("_") (0) Val item = Reduceditem._1.split ("_") (1) Val click_count = reduceditem._2 Row (category, item, click _count)}) Val Structtype = Structtype (Array (Structfield ("category", StringType, True), S Tructfield ("Item", StringType, True), Structfield ("Click_count", Integertype, True)) Val Hivecont        ext = new Hivecontext (rdd.context) Val categoryitemdf = Hivecontext.createdataframe (Categoryitemrow, StructType) Categoryitemdf.registertemptable ("Categoryitemtable") val Reseltdatafram = Hivecontext.sql ("Select Category,it Em,click_count from (select Category,item,click_count,row_number () "+" Through (PARTITION by category ORDER by clic K_count DESC) rank from categoryitemtable) subquery "+" WHERE Rank <= 3 ") reseltdatafram.show () Val Resultrowrdd = Reseltdatafram.rdd resultrowrdd.foreachpartition {par Titionofrecords + = {if (partitionofrecords.isempty) {println ("This RDD was not NULL but partition I  s null ")} else {//ConnectionPool is a static, lazily initialized pool of connections Val Connection = Connectionpool.getconnection () Partitionofrecords.foreach (record = {val sql = ") Insert into CATEGORYTOP3 (Category,item,client_count) VALUES (' "+ Record.getas (" category ") +" ', ' "+ record.              Getas ("item") + "'," + Record.getas ("Click_count") + ")" Val stmt = Connection.createstatement ();            Stmt.executeupdate (SQL);        }) Connectionpool.returnconnection (connection)//Return to the pool for future reuse}} }}}}/** * The inside of the StreamingContext call start method is actually the start method that starts the Jobscheduler, the message loop, in the jobscheduJobgenerator and Receivertacker are constructed inside the start of ler *, and the Start method of Jobgenerator and Receivertacker is called: * 1,jobgenerator will continue to root after startup According to Batchduration generates a job * 2,receivertracker start receiver first in spark cluster (in fact, start receiversupervisor in executor first),    After receiver receives * data, it is stored to executor via Receiversupervisor and sends metadata information of the data to Receivertracker in driver, Receivertracker * Internally, the received metadata information is managed through Receivedblocktracker * Each batchinterval will produce a specific job, but the job here is not the job referred to in Spark Core, it is based on Dstreamgraph and Generated Rdd * dag just, from Java perspective, equivalent to Runnable interface instance, at this time to run the job need to submit to Jobscheduler, in Jobscheduler through the thread pool way to find a * separate thread to submit job to the cluster operation      Line (in fact, the RDD-based action in the thread triggers a real job run), why use the thread pool?      * 1, the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool; * 2, it is possible to set the job Fair fair scheduling, this time also need multi-threading support; */Ssc.start () Ssc.awaittermination ()}}

second, based on the case source code Analysis
The Main method passes sparkconf as a parameter into the StreamingContext
StreamingContext Constructor, call Createnewsparkcontext
This method creates a Sparkcontext object that shows that Sparkstreaming is an application on the spark core.

Persistent Operation Checkpoint

Ssc.checkpoint ("/root/documents/sparkapps/checkpoint")

Create Sockettextstream to get the input data source

Create Socketstream

Socketinputdstream inherits the Receiverinputdstream class, which has Getreceiver (), Getstart (), and Getstop () methods

There are onstart,onstop,receiver methods in Sockdetreceiver class

Create a Socketinputstream receive method to get the data source

Data output:categoryuserclicklogsdstream.foreachrdd

Job Job Generation

Dstream Generatedrdds in the Getorcompute method to obtain the RDD data for a given time

Ssc.start (), call Jobscheduler's Start method, which also calls the Receivertracker.start (), Jobgenerator.start (), here slightly

Finally by Shanghai-Ding Liqing classmate agreed, reprint the following flowchart, really great!

5th lesson: A case-based class runs through spark streaming flow computing framework running source

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More