Spark Set Plate 2: A thorough understanding of sparkstreaming through the case of Kick II

Source: Internet
Author: User

This lesson mainly from the following two aspects to decrypt sparkstreaming:

First, decryption sparkstreaming operation mechanism

Second, decrypt the sparkstreaming architecture

The Sparkstreaming runtime is more like an application on Sparkcore, sparkstreaming starts a lot of jobs when it starts, each batchintval, Windowbykey job, and the framework runs the startup job. For example, receiver starts up with a job, which is another job service, so a complex spark program is required, often with multiple jobs working with each other. Sparkstreaming is the most complex application, and there is no problem with other spark applications if the sparkstreaming is at your fingertips. Look at Xia Guan: the spark sql,sparkstreaming,spark Ml,spark Graphx Sub-frame is developed behind, and we want to see Spark Core, Sparkstreaming is the best way to cut in.

Enter the Spark website to see the relationship between Sparkcore and other sub-frames:

after the sparkstreaming is started, data is constantly flowing through the inputstream, divided into different jobs according to time, that is batchs of input data, each job has a sequence of rdd dependencies. The RDD relies on input data, so here's the different Rdd-dependent batch, which is a different job, based on the spark engine. Dstream is a logical level, and the RDD is a physical level. Dstream is a collection that encapsulates the RDD with time flowing inside. The operation of the Dstream, turned around is the RDD operation on its interior.

I use Sparkcore programming is based on the RDD programming, the RDD has dependencies, such as the right side of the dependency graph, sparkstreaming runtime, based on time for the dimension of continuous operation. The DAG dependency of the RDD is a spatial dimension, and Dstream adds a time dimension to the RDD, thus constituting the sparkstreaming dimension.

Sparkstreaming on the basis of the RDD added time dimension, the runtime can clearly see Jobscheduler, Mappartitionrdd, Shuffledrdd, Blockmaanager and so on, These are sparkcore content, and Dstream, Jobgenerator, Socketinputdstream, and so on are sparkstreaming content, such as the running process can be very clear to see:

Now through the space-time dimension of sparkstreaming to elaborate sparkstreaming operation mechanism

Time dimension: The Job object is continuously generated at regular intervals and runs on the cluster:

Includes batch interval, window length, window sliding time, etc.

Spatial dimension: A step that represents the specific processing logic of the RDD dependency, which is represented by Dstream:

1. Generate templates that require Rdd,dag

2, Timeline Job Controller,

3. Data input and output of InputStream and OutputStream representatives

4, the specific job runs on the spark cluster, this time the system fault tolerance is very important

5, transaction processing, in the case of a rush to ensure exactly once transaction semantic consistency

Over time, based on Dstream graph, the generation of the Rdd graph, which is DAG, is generated and submitted to the spark cluster by the job scheduler thread pool,

The relationship between RDD and Dstream is as follows:

1, the RDD is the physical level, and the DStream is the logical level;

2, Dstream is the package template class of Rdd, is the further abstraction of RDD;

3, Dstream to rely on RDD to carry out the specific data calculation;

Spark Streaming Source parsing

1. The Start method for calling Jobscheduler in the StreamingContext method:

  Val SSC = new StreamingContext (conf, Seconds (5))

Val lines = Ssc.sockettextstream ("Master", 9999)

...//Business processing code slightly

Ssc.start ()
Ssc.awaittermination ()

We proceed to the internal analysis of the Jobscheduler start method:

1, Jobscheduler through the OnReceive method to receive various messages and deposited in the Enventloop message loop body.

2, through the Ratecontroller to flow into the sparkstreaming data to control the current limit.

3. Jobgenerator and Receivertacker are constructed inside the start of Jobscheduler, and the Start method of Jobgenerator and Receivertacker is called.

Receivertacker Start-up method:

1, Receivertracker will create a receivertrackerendpoint this message loop body, to receive the message sent by receiver running on the executor.

2. The receivers in the executor will be launched in spark cluster after Receivertracker starts.

Jobgenerator Start-up method:

1. Jobgenerator start timer that sends generatejobs messages at batchinterval intervals after startup

Spark sparkstreaming 2: A thorough understanding of kick by case study of the two

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.