(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance

Source: Internet
Author: User

The contents of this lesson:

1. Spark Streaming job architecture and operating mechanism

2. Spark streaming job fault tolerant architecture and operating mechanism


Understanding the entire architecture and operating mechanism of the spark streaming job is critical to mastering spark streaming.

First, we run the following program, and then through the program's running process to further deepen the spark streaming flow processing job execution process Understanding, the code is as follows:

Object Onlineforeachrdd2db {

def main (args:array[string]) {

/*

* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,

* For example, use Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set

* For local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g.

* Only 1G of memory) for beginners *

*/

Val conf = new sparkconf ()//Create sparkconf Object

Conf.setappname ("onlineforeachrdd2db")//Set the name of the application, you can see the name in the monitoring interface of the program run

Conf.setmaster ("spark://master:7077")//At this time, the program runs on the spark cluster

Conf.setmaster ("local[6]")//Local

Set the batchduration time interval to control the frequency of job generation and create portals for spark streaming execution

Val SSC = new StreamingContext (conf, Seconds (30))

Val lines = Ssc.sockettextstream ("Master", 9999)

Val wordcounts = Lines.flatmap (_.split ("")). Map ((_,1)). Reducebykey (_+_)

Jobscheduler Foreachrdd This action to trigger a real job execution

wordcounts. Foreachrdd {Rdd =

rdd.foreachpartition {partitionofrecords = {

        // connectionpool is a static,  Lazily initialized pool of connections

       &NBSP, val connection =  connectionpool .getconnection ()  //connectionpool This is a function to create a connection, you need to write it yourself

Partitionofrecords.foreach (record = {

Val sql = "INSERT into Streaming_itemcount (Item,count) VALUES ('" + record._1 + "'," + record._2 + ")"

Val stmt = Connection.createstatement ();

Stmt.executeupdate (SQL);

})

Connectionpool.returnconnection (connection)//Return to the pool for future reuse

}

}

}


Ssc.start () //Call the Start method in Jobscheduler

Ssc.awaittermination ()

}

}


Note:

The inside of the StreamingContext call Start method is actually starting the Jobscheduler start method, the message loop, in JobSchedule the start interior of R will construct Jobgenerator and Receivertacker,

and call the Start method of Jobgenerator and Receivertacker:

1,jobgenerator after the start will continue to generate a job according to Batchduration

2,receivertracker starts the receiver in spark cluster first (actually starts receiversupervisor in executor),

After receiver receives the data, it is stored on the executor via Receiversupervisor, and the metadata information of the data is sent to Receivertracker in driver.

The received metadata information is managed internally through the Receivertracker Receivedblocktracker.

3, each batchinterval will produce a specific job, in fact, the job here is not in Spark core refers to the job, it is based on Dstream graph and the RDD generated by the Dag just,

From the Java point of view, equivalent to the Runnable interface instance, at this time to run the job needs to be submitted to Jobscheduler, in Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run

(in fact, the RDD-based action in the thread triggers the actual operation of the job), why use the thread pool?

1, the job is constantly generated, so in order to improve efficiency, we need the thread pool, which is similar to executing a task in executor through a thread pool;

2, it is possible to set the job of the Fair Fair scheduling method, this time also need multi-threading support;


The overall work flow chart can be summarized as follows:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M01/7F/BB/wKiom1cquLygGZFEAACYDmcI9sY936.png "title=" Sparkstreaming job run flowchart. png "alt=" Wkiom1cqulyggzfeaacydmci9sy936.png "/>


The fault-tolerant mechanism of the whole spark streaming is based on the RDD fault-tolerant mechanism

The main performance is:

1, checkpoint

2. High-tolerance mechanism based on descent (lineage)

3, after the error will be from the wrong position from the new calculation, and will not lead to repeated calculations and other ways

This is one of the subtleties of spark streaming's design.


Reference Blog: http://my.oschina.net/corleone/blog/669520


Note:

Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)-IMF

For more private content, please follow the public number: Dt_spark

If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580

Life was short,you need to Spark.


This article is from "Dt_spark Big Data DreamWorks" blog, please make sure to keep this source http://18610086859.blog.51cto.com/11484530/1770334

(Version Customization) Lesson 3rd: Understanding Spark streaming from the standpoint of job and fault tolerance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.