The contents of this lesson:
1. Spark Streaming job architecture and operating mechanism
2. Spark streaming job fault tolerant architecture and operating mechanism
Understanding the entire architecture and operating mechanism of the spark streaming job is critical to mastering spark streaming.
First, we run the following program, and then through the program's running process to further deepen the spark streaming flow processing job execution process Understanding, the code is as follows:
Object Onlineforeachrdd2db {
def main (args:array[string]) {
/*
* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,
* For example, use Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set
* For local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g.
* Only 1G of memory) for beginners *
*/
Val conf = new sparkconf ()//Create sparkconf Object
Conf.setappname ("onlineforeachrdd2db")//Set the name of the application, you can see the name in the monitoring interface of the program run
Conf.setmaster ("spark://master:7077")//At this time, the program runs on the spark cluster
Conf.setmaster ("local[6]")//Local
Set the batchduration time interval to control the frequency of job generation and create portals for spark streaming execution
Val SSC = new StreamingContext (conf, Seconds (30))
Val lines = Ssc.sockettextstream ("Master", 9999)
Val wordcounts = Lines.flatmap (_.split ("")). Map ((_,1)). Reducebykey (_+_)
Jobscheduler Foreachrdd This action to trigger a real job execution
wordcounts. Foreachrdd {Rdd =
rdd.foreachpartition {partitionofrecords = {
// connectionpool is a static, Lazily initialized pool of connections
       &NBSP, val connection = connectionpool .getconnection () //connectionpool This is a function to create a connection, you need to write it yourself
Partitionofrecords.foreach (record = {
Val sql = "INSERT into Streaming_itemcount (Item,count) VALUES ('" + record._1 + "'," + record._2 + ")"
Val stmt = Connection.createstatement ();
Stmt.executeupdate (SQL);
})
Connectionpool.returnconnection (connection)//Return to the pool for future reuse
}
}
}
Ssc.start () //Call the Start method in Jobscheduler
Ssc.awaittermination ()
}
}
Note:
The inside of the StreamingContext call Start method is actually starting the Jobscheduler start method, the message loop, in JobSchedule the start interior of R will construct Jobgenerator and Receivertacker,
and call the Start method of Jobgenerator and Receivertacker:
1,jobgenerator after the start will continue to generate a job according to Batchduration
2,receivertracker starts the receiver in spark cluster first (actually starts receiversupervisor in executor),
After receiver receives the data, it is stored on the executor via Receiversupervisor, and the metadata information of the data is sent to Receivertracker in driver.
The received metadata information is managed internally through the Receivertracker Receivedblocktracker.
3, each batchinterval will produce a specific job, in fact, the job here is not in Spark core refers to the job, it is based on Dstream graph and the RDD generated by the Dag just,
From the Java point of view, equivalent to the Runnable interface instance, at this time to run the job needs to be submitted to Jobscheduler, in Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run
(in fact, the RDD-based action in the thread triggers the actual operation of the job), why use the thread pool?
1, the job is constantly generated, so in order to improve efficiency, we need the thread pool, which is similar to executing a task in executor through a thread pool;
2, it is possible to set the job of the Fair Fair scheduling method, this time also need multi-threading support;
The overall work flow chart can be summarized as follows:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/7F/BB/wKiom1cquLygGZFEAACYDmcI9sY936.png "title=" Sparkstreaming job run flowchart. png "alt=" Wkiom1cqulyggzfeaacydmci9sy936.png "/>
The fault-tolerant mechanism of the whole spark streaming is based on the RDD fault-tolerant mechanism
The main performance is:
1, checkpoint
2. High-tolerance mechanism based on descent (lineage)
3, after the error will be from the wrong position from the new calculation, and will not lead to repeated calculations and other ways
This is one of the subtleties of spark streaming's design.
Reference Blog: http://my.oschina.net/corleone/blog/669520
Note:
Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)-IMF
For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580
Life was short,you need to Spark.
This article is from "Dt_spark Big Data DreamWorks" blog, please make sure to keep this source http://18610086859.blog.51cto.com/11484530/1770334
(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance