2.Spark streaming operating mechanism and architecture

Source: Internet
Author: User

1 decrypting spark streaming operating mechanism

Last lesson we talked about the technology industry's Dragon Quest. This is like Feng Shui in the past, each area has its own dragon vein, Spark is where the dragon vein, its dragon Cave or the key point is sparkstreaming. This is one of the conclusions we know very clearly in the last lesson. And in the last lesson, we adopted the way of dimensionality reduction. The so-called dimension of the way, refers to the time to enlarge, that is, the time to become longer, we do sparkstreaming case demonstration of the actual combat, the result is that we found in a specific time period, is indeed the specific Rdd in the work, Then it is necessary for this class to talk about its operating mechanism and concrete structure on the basis of the previous lesson.

Sparkstreaming Contains two levels, the first one is around you each batch application, Often multiple Job run, we're more feeling it's a complex application. The last lesson we said in this sentence, Sparkstreaming

Let's take a look at the following image from the Spark website (http://spark.apache.org), Apachespark is spark Core, and when Spark was released, it didn't have Apache at first. The sub-frame above Spark, they are developed gradually. This nonsense is actually meaningful because we can use the upper frame to gain insight into the mechanics of Spark's internals. Our last lesson also talked about the reasons for customizing the spark source from spark streaming , and why we didn't start with other frameworks, so we're going to go through spark streaming to get a thorough look at Spark.

we used to Sparkcore programming, including previous Dataframe Dataset (said Spark 2.x programming. The entire Spark and the process of Sparkcore

Let's look at the picture below. The rddgraph on the right shows that Rdd A relies on the rdd m, the rdd m further relies on the Rdd J, and theRDDJ relies on the Rdd B. This a is sparkaction from an operational point of view. Here we see 3 spark jobs. If the 3 jobs are not executed from the perspective of Dstream . But from the rdd point of view, because it is an action, it will be executed specifically. Why do you say that? First we looked at the Rdd Dag, and Spark streaming added a time dimension based on the Rdd dag.

Looking back on yesterday's class run asparkstreamingprogram that we seeSpark StreamingIn addition to their own frameworkScheduler. Jobscheduler,Dstream. Socketinputdstream and other content, but also used Sparkcore of various types of Rdd, such as Rdd. Mappartitionrdd,Rdd. Partitionerawareunionrdd,Rdd. Shufflerdd and Rdd. Blockrdd and so on. This Spark streaming program is not much different from the program we wrote before, but we see that it is constantly running, it is constantly running, because in the continuous cycle, the basis of the cycle is based on the time dimension. Thatis, DStream is an RDD based on the time dimension, and the Rdd DAG relies on also called the spatial dimension, so the whole spark streaming is a space-time dimension of the thing.

Let's take a look at the diagram below. The data is constantly flowing through the Inputdata stream, which means that there is time for the Spark streaming to divide the incoming data into different jobs(batches of Inputdata) based on time. Each job has a corresponding Rdd dependency, and each Rdd dependency has input data, so it can be seen as a batch with different Rdd dependencies, and batch is the job; The Engine came up with one result after another.

We continue to look at the bottom part, when the operation is based onRDDThe spatial dimension of time1 time2 time3 4 rdd rdd sparkstreaming very powerful. , Only time-based, and all other logical and schema decoupling sparkstreaming job

2 decrypting the spark streaming architecture

For the space-time dimension, we can imagine that there is a coordinate, the x-axis and the y-axis, and the y-axis is the operation of the Rdd, the so-called Rdd dependency that forms the entire job logic, and the X- axis is the time. Over time, assuming that it is 1 seconds, it generates a job, a job instance, because it is in space-time, and everything in space-time is moving, so the job generated in 1 seconds runs in the cluster as a movement. We can even delve into the philosophical level of what is in it, when you look at it with a feeling of wellbeing.

            We use a diagram to clarify the space-time dimension coordinates just mentioned. We use4ARDD Graph(sometimes also calledRDD DAG) represents the passage of time, based onDStream GraphConstantly generateRddgraph, namelyDAGThe way to produceJob, and throughJobschedulerThe thread pool is submitted toSparkclusterContinue to execute. Data is constantly flowing in,Spark StreamingContinue to produceJoband accumulating data, the way to accumulate data is throughBatch IntervalLike what1Seconds. In this1Seconds, there will usually be a lot ofEventor data, if we are based onFlumeOrKafkaWay, we'll feel a lot ofEventThe existence of. Like looking atFlumeThis way, if -Milliseconds produces aEventSo1Second, there is.TenAEventThis oneEventis representative data. This makes up a collection of data. andRDDProcessing is based on a fixed set of data. With the accumulation of time, will accumulate a lot ofEvent。1Second accumulationTenAEventSo1Seconds ofRDDis generated by these data sets.SparkIn theJobSome can do it. -Millisecond-level run, just nowSparkThe scheduling mechanism has resulted in a production environment in which mostJobGreater than -Milliseconds to look like. We've found control.JobIn -Milliseconds in the way. Because of the time fixed, each time interval producesRDDis fixed. EachRdddagBased on one of the time intervalsBatchThe data. Although each time interval in the figure is2ADStream(2AB), but the data captured in this time interval belongs to thisBatchThe data. In the picture hereDAGWhy dependencies can produce3AJob? Because it isRdddagDependence, whileRDD DAG The dependency does not limit how many job dag It's dealing with this time interval. 1 batch 4 1 seconds, so there are 4 batch rdddag

Note: As time goes on, sparkstreaming will continue to process, even if there is no input data, but there is no valid output.

Let's take a look at some of the key points in the architecture from a sparkstreaming perspective:

1) generate template for Rdd DAG required: dstreamgraph

Since spark streaming is based on Sparkrdd, there must be one thing that represents the processing logic of the RDD DAG, the spatial dimension.

2) requires timeline-based job control

3) requires inputstreamings and outputstreamingto represent the input and output of the data

Dstreamgraph.scala:

Privateval inputstreams = new Arraybuffer[inputdstream[_]] ()
Private Val outputstreams = new Arraybuffer[dstream[_]] ()

Inputdstream.scala:

AbstractClass Inputdstream[t:classtag] (ssc_: StreamingContext)
Extends Dstream[t] (SSC_) {

4) The specific job runs on top of the spark Cluster , when system fault tolerance is critical

-Sparkstreaming can limit the flow when the traffic is too large, and can dynamically adjust the CPU, memory and other resources.

5) Transaction processing required

- We hope that the data flowing in will be processed and processed only once. How to guarantee the transactional semantics of exactly once in the event of a crash.

before we finish this class, we'll take a preliminary look at Dstream.

, from the code in the previous 2 red boxes, DStream is the core of spark streaming, just as the Rdd is the core of Spark core, and it has dependencies and compute.

The code in the 3rd red box indicates that Generatedrdds is a hashmap, with the time keyand the RDD array as value. So Dstream is the model of the RDD. DStreamIt can be said to be a logical level, theRdd is the physical level,DStream expression of the end is converted into RDD to achieve. The former is a higher level of abstraction, the latter being the underlying implementation. DStream in fact, the encapsulation of the Rdd collection in the Time dimension , the relationship between theDStream and the rdd is the constant creation of the RDD over time . , the operation of the DStream is to operate the RDDat a fixed time.

DStream has a lot of sub-classes.

conversion between DStream is the conversion of subclasses. It is also actually a conversion of the RDD, which then generates a dependency job andruns through the Jobscheduler on the cluster.

Summarize:

The business logic in the spatial dimension acts on the Dstream, and as time goes by, each batchinterval forms a specific data set, generates an RDD, transform the Rdd , and then forms the RDD dependency of the Rdd DAG, which forms the job. Then Jobscheduler based on the time schedule, the Rdd dependency, publish the job to the Spark cluster to run, and constantly generate spark jobs.

Note:
This blog is from the spark release version of the custom course
pasted from: >  

2.Spark streaming operating mechanism and architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.