3rd Lesson: A thorough understanding of sparkstreaming through cases kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Source: Internet
Author: User

Understanding the entire architecture and operating mechanism of the spark streaming job is critical to mastering spark streaming.

First we run the following program, and then through this program running process further deepen understanding of the spark streaming flow processing job execution process, the code is as follows:

Object Onlineforeachrdd2db {

def main (args:array[string]) {

/*

* 1th step: Create a Spark Configuration object sparkconf, set the runtime configuration information for the Spark program,

* For example, use Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set

* For local, the spark program is run locally and is particularly suitable for very poor machine configuration conditions (e.g.

* Only 1G of memory) for beginners *

*/

Val conf = new sparkconf ()//Create sparkconf Object

Conf.setappname ("Onlineforeachrdd")//Set the name of the application, you can see the name in the monitoring interface of the program run

Conf.setmaster ("spark://master:7077")//At this time, the program is in the Spark cluster

Conf.setmaster ("local[6]")

Set the batchduration time interval to control the frequency of job generation and create portals for spark streaming execution

Val SSC = new StreamingContext (conf, Seconds (5))

Val lines = Ssc.sockettextstream ("Master", 9999)

Val words = Lines.flatmap (_.split (""))

Val wordcounts = words.map (x = (x, 1)). Reducebykey (_ + _)

Wordcounts.foreachrdd {Rdd =

rdd.foreachpartition {partitionofrecords = {

ConnectionPool is a static, lazily initialized pool of connections

Val connection = Connectionpool.getconnection ()

Partitionofrecords.foreach (record = {

Val sql = "INSERT into Streaming_itemcount (Item,count) VALUES ('" + record._1 + "'," + record._2 + ")"

Val stmt = Connection.createstatement ();

Stmt.executeupdate (SQL);

})

Connectionpool.returnconnection (connection)//Return to the pool for future reuse

}

}

}

/**

* Inside the StreamingContext call Start method is actually starting the Jobscheduler start method, the message loop, in Jobscheduler

* The start interior constructs Jobgenerator and Receivertacker, and calls the Start method of Jobgenerator and Receivertacker:

* 1,jobgenerator will continue to generate a job based on batchduration after startup

* 2,receivertracker start receiver in Spark cluster first (in fact, start receiversupervisor in executor), receiver received

* Data will be stored via receiversupervisor to executor and the metadata information of the data sent to Receivertracker in driver, Receivertracker

* Internally, the received metadata information is managed via Receivedblocktracker

* Each batchinterval will produce a specific job, actually the job here is not the job referred to in Spark Core, it is only the RDD generated based on dstreamgraph

* Dag, from Java perspective, equivalent to the Runnable interface instance, at this time to run the job needs to be submitted to Jobscheduler, in Jobscheduler through the thread pool way to find a

* Separate threads submit jobs to the cluster to run (in fact, the RDD-based action in the thread triggers a real job run), why use the thread pool?

* 1, the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool;

* 2, it is possible to set the job of the Fair Fair scheduling method, this time also need multi-threading support;

*

*/

Ssc.start ()

Ssc.awaittermination ()

}

}

II: Perspective spark streaming from a fault-tolerant architecture perspective


We know that the relationship between Dstream and Rdd is a constant creation of the RDD over time, and the dstream operation is to operate the RDD at a fixed time. So, in a sense, spark streaming's fault-tolerant mechanism based on dstream is actually a fault-tolerant mechanism for each of the RDD, which is the genius of spark streaming.

Rdd as a distributed elastic data set, its elasticity is mainly reflected in:

1. Automatic allocation of memory and hard disk, priority based on memory


2. Based on lineage fault tolerance mechanism


3.task retries for a specified number of times


4.stage failure will automatically retry


5.checkpoint and persist multiplexing


6. Data dispatch resiliency: Dag,task is independent of resource management.


7. High elasticity of data fragmentation


Based on the characteristics of RDD, its fault-tolerant mechanism is mainly two kinds: one is checkpoint, the other is based on lineage (descent) fault tolerance. In general, Spark chooses pedigree fault tolerance because it is expensive to make checkpoints for large datasets. But in some cases, it is better to say that lineage chain is too complex and lengthy, this time need to do checkpoint.


Considering the Rdd dependency, each stage is internally narrow-dependent, which is generally based on lineage fault-tolerant, convenient and efficient. There is a wide dependency between the stages, which produces a shuffle operation, which makes the checkpoint better. In summary, the stage inside do lineage,stage between do checkpoint.


3rd Lesson: A thorough understanding of sparkstreaming through cases kick: decryption sparkstreaming operation mechanism and architecture advanced job and fault tolerance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.