(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance

Last Update:2016-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The contents of this lesson:

1. Spark Streaming job architecture and operating mechanism

2. Spark streaming job fault tolerant architecture and operating mechanism

Understanding the entire architecture and operating mechanism of the spark streaming job is critical to mastering spark streaming.

First, we run the following program, and then through the program's running process to further deepen the spark streaming flow processing job execution process Understanding, the code is as follows:

Object Onlineforeachrdd2db {

def main (args:array[string]) {

* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,

* For example, use Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set

* For local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g.

* Only 1G of memory) for beginners *

Val conf = new sparkconf ()//Create sparkconf Object

Conf.setappname ("onlineforeachrdd2db")//Set the name of the application, you can see the name in the monitoring interface of the program run

Conf.setmaster ("spark://master:7077")//At this time, the program runs on the spark cluster

Conf.setmaster ("local[6]")//Local

Set the batchduration time interval to control the frequency of job generation and create portals for spark streaming execution

Val SSC = new StreamingContext (conf, Seconds (30))

Val lines = Ssc.sockettextstream ("Master", 9999)

Val wordcounts = Lines.flatmap (_.split ("")). Map ((_,1)). Reducebykey (_+_)

Jobscheduler Foreachrdd This action to trigger a real job execution

wordcounts. Foreachrdd {Rdd =

rdd.foreachpartition {partitionofrecords = {

// connectionpool is a static, Lazily initialized pool of connections

&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP, val connection = connectionpool .getconnection () //connectionpool This is a function to create a connection, you need to write it yourself

Partitionofrecords.foreach (record = {

Val sql = "INSERT into Streaming_itemcount (Item,count) VALUES ('" + record._1 + "'," + record._2 + ")"

Val stmt = Connection.createstatement ();

Stmt.executeupdate (SQL);

})

Connectionpool.returnconnection (connection)//Return to the pool for future reuse

}

Ssc.start () //Call the Start method in Jobscheduler

Ssc.awaittermination ()

}

Note:

The inside of the StreamingContext call Start method is actually starting the Jobscheduler start method, the message loop, in JobSchedule the start interior of R will construct Jobgenerator and Receivertacker,

and call the Start method of Jobgenerator and Receivertacker:

1,jobgenerator after the start will continue to generate a job according to Batchduration

2,receivertracker starts the receiver in spark cluster first (actually starts receiversupervisor in executor),

After receiver receives the data, it is stored on the executor via Receiversupervisor, and the metadata information of the data is sent to Receivertracker in driver.

The received metadata information is managed internally through the Receivertracker Receivedblocktracker.

3, each batchinterval will produce a specific job, in fact, the job here is not in Spark core refers to the job, it is based on Dstream graph and the RDD generated by the Dag just,

From the Java point of view, equivalent to the Runnable interface instance, at this time to run the job needs to be submitted to Jobscheduler, in Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run

(in fact, the RDD-based action in the thread triggers the actual operation of the job), why use the thread pool?

1, the job is constantly generated, so in order to improve efficiency, we need the thread pool, which is similar to executing a task in executor through a thread pool;

2, it is possible to set the job of the Fair Fair scheduling method, this time also need multi-threading support;

The overall work flow chart can be summarized as follows:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/7F/BB/wKiom1cquLygGZFEAACYDmcI9sY936.png "title=" Sparkstreaming job run flowchart. png "alt=" Wkiom1cqulyggzfeaacydmci9sy936.png "/>

The fault-tolerant mechanism of the whole spark streaming is based on the RDD fault-tolerant mechanism

The main performance is:

1, checkpoint

2. High-tolerance mechanism based on descent (lineage)

3, after the error will be from the wrong position from the new calculation, and will not lead to repeated calculations and other ways

This is one of the subtleties of spark streaming's design.

Reference Blog: http://my.oschina.net/corleone/blog/669520

Note:

Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)-IMF

For more private content, please follow the public number: Dt_spark

If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580

Life was short,you need to Spark.

This article is from "Dt_spark Big Data DreamWorks" blog, please make sure to keep this source http://18610086859.blog.51cto.com/11484530/1770334

(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support