5th lesson: A case-based class runs through spark streaming flow computing framework running source

Source: Internet
Author: User
Tags samsung android

Contents of this issue:

1 Online Dynamic Computing classification the most popular products case review and demonstration

2 Case-based penetration Spark Streaming the operating source

First, the case code

Dynamically calculate the hottest product rankings in different categories of e-commerce, such as the hottest three phones in the phone category, the hottest three TVs in the TV category, etc.

Package Com.dt.spark.sparkstreamingimport Org.apache.spark.SparkConfimport Org.apache.spark.sql.Rowimport Org.apache.spark.sql.hive.HiveContextimport Org.apache.spark.sql.types. {integertype, StringType, Structfield, Structtype}import org.apache.spark.streaming. {Seconds, streamingcontext}/**  * use Spark Streaming+spark SQL to dynamically calculate the hottest rankings in different categories of e-commerce online, such as the three most popular mobile phones and TVs under this category   * The most popular three kinds of TV, this example in the actual production environment has very significant significance;  * @author DT Big Data DreamWorks   * Sina Weibo:http://weibo.com/ilovepains/  * Implementation technology: Spark Streaming+spark SQL, the reason spark streaming can use ML, SQL, GRAPHX and other functions because there are foreachrdd and transform  * and other interfaces, These interfaces are actually based on the RDD, so with the RDD as the cornerstone, you can use all of the other spark functions directly, just as simple as calling the API directly.   *   Suppose that the format of the data here: User item category, for example Rocky Samsung android  */object onlinethetop3itemforeachcategory2db {def main (args:array[string]) {/** * 1th step: Create a Spark Configuration object sparkconf, set the spark program to run Configuration information, * For example, by using Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if the setting * is local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g. * only 1G    Memory) for Beginners   * */Val conf = new sparkconf ()//Create sparkconf Object Conf.setappname ("onlinethetop3itemforeachcategory2db")//settings should With the name of the program, the program runs in the monitoring interface can see the name//Conf.setmaster ("spark://master:7077")//At this time, the program in the Spark cluster Conf.setmaster ("local[6]")//Set BA Tchduration time interval to control the frequency of job generation and to create a spark streaming execution portal Val SSC = new StreamingContext (conf, Seconds (5)) Ssc.checkpoint (" /root/documents/sparkapps/checkpoint ") Val Userclicklogsdstream = Ssc.sockettextstream (" Master ", 9999) Val formatted Userclicklogsdstream = Userclicklogsdstream.map (Clicklog = Clicklog.split ("") (2) + "_" + Clicklog.split ("") ( 1), 1)//Val Categoryuserclicklogsdstream = Formatteduserclicklogsdstream.reducebykeyandwindow ((v1:Int, V2:int) =&gt ; V1 + v2,//(v1:int, v2:int) = V1-v2, Seconds, Seconds (()) Val Categoryuserclicklogsdstream = formatted Userclicklogsdstream.reducebykeyandwindow (_+_, _-_, Seconds (), Seconds) Categoryuserclicklogsdstream.foreach      Rdd {Rdd = {if (Rdd.isempty ()) {println ("No data inputted!!!")          } else {val categoryitemrow = rdd.map (Reduceditem = {val category = Reduceditem._1.split ("_") (0) Val item = Reduceditem._1.split ("_") (1) Val click_count = reduceditem._2 Row (category, item, click _count)}) Val Structtype = Structtype (Array (Structfield ("category", StringType, True), S Tructfield ("Item", StringType, True), Structfield ("Click_count", Integertype, True)) Val Hivecont        ext = new Hivecontext (rdd.context) Val categoryitemdf = Hivecontext.createdataframe (Categoryitemrow, StructType) Categoryitemdf.registertemptable ("Categoryitemtable") val Reseltdatafram = Hivecontext.sql ("Select Category,it Em,click_count from (select Category,item,click_count,row_number () "+" Through (PARTITION by category ORDER by clic K_count DESC) rank from categoryitemtable) subquery "+" WHERE Rank <= 3 ") reseltdatafram.show () Val Resultrowrdd = Reseltdatafram.rdd resultrowrdd.foreachpartition {par Titionofrecords + = {if (partitionofrecords.isempty) {println ("This RDD was not NULL but partition I  s null ")} else {//ConnectionPool is a static, lazily initialized pool of connections Val Connection = Connectionpool.getconnection () Partitionofrecords.foreach (record = {val sql = ") Insert into CATEGORYTOP3 (Category,item,client_count) VALUES (' "+ Record.getas (" category ") +" ', ' "+ record.              Getas ("item") + "'," + Record.getas ("Click_count") + ")" Val stmt = Connection.createstatement ();            Stmt.executeupdate (SQL);        }) Connectionpool.returnconnection (connection)//Return to the pool for future reuse}} }}}}/** * The inside of the StreamingContext call start method is actually the start method that starts the Jobscheduler, the message loop, in the jobscheduJobgenerator and Receivertacker are constructed inside the start of ler *, and the Start method of Jobgenerator and Receivertacker is called: * 1,jobgenerator will continue to root after startup According to Batchduration generates a job * 2,receivertracker start receiver first in spark cluster (in fact, start receiversupervisor in executor first),    After receiver receives * data, it is stored to executor via Receiversupervisor and sends metadata information of the data to Receivertracker in driver, Receivertracker * Internally, the received metadata information is managed through Receivedblocktracker * Each batchinterval will produce a specific job, but the job here is not the job referred to in Spark Core, it is based on Dstreamgraph and Generated Rdd * dag just, from Java perspective, equivalent to Runnable interface instance, at this time to run the job need to submit to Jobscheduler, in Jobscheduler through the thread pool way to find a * separate thread to submit job to the cluster operation      Line (in fact, the RDD-based action in the thread triggers a real job run), why use the thread pool?      * 1, the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool; * 2, it is possible to set the job Fair fair scheduling, this time also need multi-threading support; */Ssc.start () Ssc.awaittermination ()}}
second, based on the case source code Analysis
The Main method passes sparkconf as a parameter into the StreamingContext
StreamingContext Constructor, call Createnewsparkcontext
This method creates a Sparkcontext object that shows that Sparkstreaming is an application on the spark core.

Persistent Operation Checkpoint
Ssc.checkpoint ("/root/documents/sparkapps/checkpoint")

Create Sockettextstream to get the input data source


Create Socketstream


Socketinputdstream inherits the Receiverinputdstream class, which has Getreceiver (), Getstart (), and Getstop () methods




There are onstart,onstop,receiver methods in Sockdetreceiver class


Create a Socketinputstream receive method to get the data source


Data output:categoryuserclicklogsdstream.foreachrdd


Job Job Generation


Dstream Generatedrdds in the Getorcompute method to obtain the RDD data for a given time



Ssc.start (), call Jobscheduler's Start method, which also calls the Receivertracker.start (), Jobgenerator.start (), here slightly


Finally by Shanghai-Ding Liqing classmate agreed, reprint the following flowchart, really great!




5th lesson: A case-based class runs through spark streaming flow computing framework running source

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.