Spark Set Plate 2: A thorough understanding of sparkstreaming through the case of Kick II

Last Update:2016-05-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This lesson mainly from the following two aspects to decrypt sparkstreaming:

First, decryption sparkstreaming operation mechanism

Second, decrypt the sparkstreaming architecture

The Sparkstreaming runtime is more like an application on Sparkcore, sparkstreaming starts a lot of jobs when it starts, each batchintval, Windowbykey job, and the framework runs the startup job. For example, receiver starts up with a job, which is another job service, so a complex spark program is required, often with multiple jobs working with each other. Sparkstreaming is the most complex application, and there is no problem with other spark applications if the sparkstreaming is at your fingertips. Look at Xia Guan: the spark sql,sparkstreaming,spark Ml,spark Graphx Sub-frame is developed behind, and we want to see Spark Core, Sparkstreaming is the best way to cut in.

Enter the Spark website to see the relationship between Sparkcore and other sub-frames:

after the sparkstreaming is started, data is constantly flowing through the inputstream, divided into different jobs according to time, that is batchs of input data, each job has a sequence of rdd dependencies. The RDD relies on input data, so here's the different Rdd-dependent batch, which is a different job, based on the spark engine. Dstream is a logical level, and the RDD is a physical level. Dstream is a collection that encapsulates the RDD with time flowing inside. The operation of the Dstream, turned around is the RDD operation on its interior.

I use Sparkcore programming is based on the RDD programming, the RDD has dependencies, such as the right side of the dependency graph, sparkstreaming runtime, based on time for the dimension of continuous operation. The DAG dependency of the RDD is a spatial dimension, and Dstream adds a time dimension to the RDD, thus constituting the sparkstreaming dimension.

Sparkstreaming on the basis of the RDD added time dimension, the runtime can clearly see Jobscheduler, Mappartitionrdd, Shuffledrdd, Blockmaanager and so on, These are sparkcore content, and Dstream, Jobgenerator, Socketinputdstream, and so on are sparkstreaming content, such as the running process can be very clear to see:

Now through the space-time dimension of sparkstreaming to elaborate sparkstreaming operation mechanism

Time dimension: The Job object is continuously generated at regular intervals and runs on the cluster:

Includes batch interval, window length, window sliding time, etc.

Spatial dimension: A step that represents the specific processing logic of the RDD dependency, which is represented by Dstream:

1. Generate templates that require Rdd,dag

2, Timeline Job Controller,

3. Data input and output of InputStream and OutputStream representatives

4, the specific job runs on the spark cluster, this time the system fault tolerance is very important

5, transaction processing, in the case of a rush to ensure exactly once transaction semantic consistency

Over time, based on Dstream graph, the generation of the Rdd graph, which is DAG, is generated and submitted to the spark cluster by the job scheduler thread pool,

The relationship between RDD and Dstream is as follows:

1, the RDD is the physical level, and the DStream is the logical level;

2, Dstream is the package template class of Rdd, is the further abstraction of RDD;

3, Dstream to rely on RDD to carry out the specific data calculation;

Spark Streaming Source parsing

1. The Start method for calling Jobscheduler in the StreamingContext method:

　　Val SSC = new StreamingContext (conf, Seconds (5))

Val lines = Ssc.sockettextstream ("Master", 9999)

...//Business processing code slightly

Ssc.start ()
Ssc.awaittermination ()

We proceed to the internal analysis of the Jobscheduler start method:

1, Jobscheduler through the OnReceive method to receive various messages and deposited in the Enventloop message loop body.

2, through the Ratecontroller to flow into the sparkstreaming data to control the current limit.

3. Jobgenerator and Receivertacker are constructed inside the start of Jobscheduler, and the Start method of Jobgenerator and Receivertacker is called.

Receivertacker Start-up method:

1, Receivertracker will create a receivertrackerendpoint this message loop body, to receive the message sent by receiver running on the executor.

2. The receivers in the executor will be launched in spark cluster after Receivertracker starts.

Jobgenerator Start-up method:

1. Jobgenerator start timer that sends generatejobs messages at batchinterval intervals after startup

Spark sparkstreaming 2: A thorough understanding of kick by case study of the two

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Set Plate 2: A thorough understanding of sparkstreaming through the case of Kick II

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Set Plate 2: A thorough understanding of sparkstreaming through the case of Kick II

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support