Spark version Custom version 3-through the case to sparkstreaming thorough understanding kick three

Source: Internet
Author: User

The content of this lecture:

A. Spark streaming Job architecture and operating mechanism
B. Spark streaming Job fault tolerant architecture and operating mechanism

Note: This lecture is based on the spark 1.6.1 version (the latest version of Spark in May 2016).

Previous section review:

The last lesson that Spark streaming is based on dstream programming. Dstream is a logical level, and the RDD is a physical level. Dstream is a collection that encapsulates the RDD with time flowing inside. The operation on Dstream is ultimately the operation of its rdd.

If you put spark streaming in a coordinate system and use the y-axis for the RDD operation, the RDD dependency forms the logic app for the entire job, with the x-axis as the time. As time passes, a job instance is generated at a fixed interval (Batch Interval), which is then run in the cluster.

At the same time, we also summarize and uncover the five core features of Spark streaming: Feature 1: Logic Management, feature 2: Time Management, feature 3: Streaming input and output, feature 4: High fault tolerance, feature 5: Transaction processing. Finally, combined with spark streaming source code to do further analysis.

**

Lecturing

**

As you can tell from the previous session, a job instance is generated at a fixed time interval (Batch Interval). So what is the architecture and operating mechanism of the job, its fault-tolerant architecture, and its operating mechanism in spark streaming, which consists of time dimension and spatial dimension?

So let's start with Einstein's relative space:

A, time and space are closely related to the unity, also known as space-time continuum.
b, time and space are relative, different observers see the timing, length, quality can vary.
C, for two non-linked events, there is no absolute sequencing. But causality can determine the sequence of events, such as job instances generated and run in the cluster, then the occurrence of job instances will inevitably occur in the job run cluster before.

That is to say, there is no definite connection between the occurrence of the job and the time of one-way flow; Time here is just an illusion.

How to understand this sentence better? Then we have to step up from the following aspects for everyone to answer.

What is the spark streaming JOB architecture and operating mechanism?

For a typical spark application, the action action of the RDD triggers the operation of the job. So how does the job work for sparkstreaming? When we wrote the Sparkstreaming program, we set the batchduration,job to automatically trigger every batchduration time, which is the spark streaming framework provides a timer, The program is submitted to spark as soon as the time is up and runs as a spark job.

Perspective job schema and run mechanism through case

The case code is as follows:


Make the above code into a jar package and then upload it to the cluster to run



The results of running in the cluster are as follows

The general diagram of the operation process is as follows

Case Detail Analysis

A, first through the StreamingContext call Start method, its internal restart Jobscheduler Start method, the message loop;

(streamingcontext.scala,610 line code)

(jobscheduler.scala,83 line code)

B, Jobgenerator and Receivertacker are constructed within the start of Jobscheduler;

(jobscheduler.scala,82, 83 lines of code)

c, and then call the Start method of Jobgenerator and Receivertacker to do the following:

(jobscheduler.scala,79, 98 lines of code)

(receivertacker.scala,149, 157 lines of code)

    1. Jobgenerator after the start will continue to generate a job according to Batchduration;

(jobscheduler.scala,208 line code)

    1. The main function of Receivertracker is two points:

1. The operation of receiver is managed, and the Lanuchreceivers () method is called when the Receivertracker is started, which in turn uses RPC communication to start receiver (in the actual code, There's a layer of packaging outside receiver Receiversupervisor for high availability)

(receivertracker.scala,423 line code)

2. Manage receiver's metadata for job to index data, the core structure of metadata is Receivedblocktracker

(receivertracker.scala,106~112 line code)

D, after receiver received the data will be stored through the receiversupervisor into the executor Blockmanager;

E, at the same time, the metadata information of the data sent to driver in the Receivertracker, in the Receivertracker through the Receivedblocktracker to manage the received meta-data information;

This involves the concept of two jobs:

Each batchinterval will produce a specific job, in fact, the job here is not the job referred to in Spark Core, it is only the dstreamgraph based on the RDD generated by the DAG, from the Java perspective, equivalent to the Runnable interface instance, At this point to run the job needs to be submitted to Jobscheduler, in the Jobscheduler through the thread pool way to find a separate thread to submit the job to the cluster run (actually is the RDD-based action in the thread to trigger a real job run)

Why use a thread pool?

A, the job is constantly generated, so in order to improve efficiency, we need a thread pool, which is similar to executing a task in executor through a thread pool;
b, it is possible to set the job of the Fair Fair scheduling method, this time also need multi-threading support;

Spark streaming Job fault tolerant architecture and operating mechanism

Spark streaming is a fault-tolerant mechanism based on dstream, Dstream is the creation of an rdd over time, meaning that Dstream operates the RDD at a fixed time, and that fault tolerance is divided into the RDD that is formed each time.

The fault tolerance of Spark streaming includes both Executor and driver fault tolerant mechanisms:

A, Executor fault tolerance:

1. Data reception: Distributed mode, Wal mode, write log before saving data to executor

2. Task execution security job is based on RDD fault tolerance:

B, driver fault tolerance: checkpoint.

Based on the characteristics of RDD, its fault-tolerant mechanism is mainly two kinds:

1. Based on checkpoint;

Between stage, is wide dependence, produced shuffle operation, lineage chain is too complex and lengthy, this time need to do checkpoint.

2. Fault tolerance based on lineage (descent):

In general, Spark chooses pedigree fault tolerance because it is expensive to make checkpoints for large datasets.

Considering the Rdd dependency, each stage is internally narrow-dependent, which is generally based on lineage fault-tolerant, convenient and efficient.

Summary: The stage inside do lineage,stage between do checkpoint.

Spark version Custom version 3-through the case to sparkstreaming thorough understanding kick three

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.