Dream_spark Custom Second Lesson

Last Update:2016-05-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark Version Customization 2 days: A thorough understanding of sparkstreaming through the case

Contents of this issue:

1 decrypting Spark streaming operating mechanism

2 decrypting the Spark streaming Architecture

All data that cannot be streamed in real time is invalid data. In the stream processing era, sparkstreaming spark ecosystem, streaming sql mllib

　　Spark Streaming the runtime is not so much about Spark Core a streaming framework on the Spark Core one of the most complex applications on the If you can master the complex application of Spark streaming , then other complex applications are a cinch. It is also a general trend to choose Spark streaming as a starting point for custom versions.

We knowSpark Coreeach step of the process is based onRDDof,RDDthere is a dependency between them. In theRDDof theDAGshows that there are3aAction, it triggers3aJob,RDDfrom the bottom up dependence,RDDProduceJobwill be implemented in concrete. FromDsteam Graphcan be seen in theDStreamthe logic andRDDbasically consistent, it is inRDDbased on the dependence of time. RDDof theDAGIt can also be called a spatial dimension, which means the entireSpark Streamingmore than one time dimension, you can also become a dimension of spacetime.

From this perspective, Spark streaming can be placed in a coordinate system. Where the Y- axis is the operation of the Rdd ,the dependency of therdd forms The logic of the entire job , and the X -axis is time. As time passes, a fixed interval (Batch Interval) generates a job instance that runs in the cluster.

For Spark streaming , when data flows from different data sources come in, a fixed set of time intervals will result in a series of fixed sets of data or event collections (for example, from Flume and the Kafka ). And this coincides with the Rdd based on a fixed set of data, in fact, the rdd Graph by DStream based on a fixed time interval itinerary It is based on a certain Batch of the data set.

It can be seen that in everyBatch, the spatial dimension of theRDDThe dependencies are the same, the difference is that the fiveBatchthe incoming data is different in size and content, so the resultingRDDinstance of the dependency, so thatRDDof theGraphborn out ofDStreamof theGraph, which meansDStreamis thatRDDtemplates, different time intervals, generate differentRDD Graphinstance.

Starting with the Spark streaming itself:

　　1. generate template for RDD DAG :DStream Graph

　　2 needs to be based on Timeline of the Job Controller

　　3 need to inputstreamings and the outputstreamings that represents the input and output of the data

　　4 specific to Job run on Spark Cluster above, because Streaming system fault tolerance is critical, regardless of whether the cluster can be digested

　　5 transaction processing, we hope that the data flowing in will be processed and processed only once. How to guarantee the transactional semantics of exactly once in the event of a crash.

Interpreting DStream from the source code

As you can see,DStream is the heart of spark streaming , and the core of Spark Core is RDD, it also has dependency and compute. More critical is the following code:

This is a HashMap, with time as key, with the RDD as value, which is also proof of the passage of time , constantly generate RDD, generate dependency jobs, and run through Jobscheduler on the cluster. Again, DStream is The template for the RDD.

　　 DStreamcan be said to be a logical level,RDDis the physical level,DStreamAll that is expressed is ultimately throughRDDof the transformation achieved. The former is a higher level of abstraction, the latter being the underlying implementation. DStreamin fact, in the time dimension,RDDthe encapsulation of the collection,DStreamwith theRDDis the relationship that has been generated over time.RDD, theDStreamOperation is done on a fixed time .RDD.

Summarize:

The business logic on the spatial dimension acts onDStream, with the passage of time, eachBatch Intervalformed a specific data set, resulting in aRDD, theRDDmakeTransformoperation, which in turn forms aRDDthe dependency relationshipRDD DAG, formingJob. ThenJobschedulerbased on the time schedule,RDDdependencies, publish the job toSpark Clusterup and running, constantly producingSparkjob.

Note:

Data from:dt_ Big Data Dream Factory (Spark release version customization)

For more private content, please follow the public number:Dt_spark

If you are interested in big Data spark , you can listen to it for free by Liaoliang teacher every night : Spark Permanent free public class, address YY room Number:68917580

Dream_spark Custom Second Lesson

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Dream_spark Custom Second Lesson

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support