Dream_spark Custom Second Lesson

Source: Internet
Author: User

Spark Version Customization 2 days: A thorough understanding of sparkstreaming through the case

Contents of this issue:

1 decrypting Spark streaming operating mechanism

2 decrypting the Spark streaming Architecture

All data that cannot be streamed in real time is invalid data. In the stream processing era, sparkstreaming spark ecosystem, streaming sql mllib

  Spark Streaming the runtime is not so much about Spark Core a streaming framework on the Spark Core one of the most complex applications on the If you can master the complex application of Spark streaming , then other complex applications are a cinch. It is also a general trend to choose Spark streaming as a starting point for custom versions.

We knowSpark Coreeach step of the process is based onRDDof,RDDthere is a dependency between them. In theRDDof theDAGshows that there are3aAction, it triggers3aJob,RDDfrom the bottom up dependence,RDDProduceJobwill be implemented in concrete. FromDsteam Graphcan be seen in theDStreamthe logic andRDDbasically consistent, it is inRDDbased on the dependence of time. RDDof theDAGIt can also be called a spatial dimension, which means the entireSpark Streamingmore than one time dimension, you can also become a dimension of spacetime.

From this perspective, Spark streaming can be placed in a coordinate system. Where the Y- axis is the operation of the Rdd ,the dependency of therdd forms The logic of the entire job , and the X -axis is time. As time passes, a fixed interval (Batch Interval) generates a job instance that runs in the cluster.

For Spark streaming , when data flows from different data sources come in, a fixed set of time intervals will result in a series of fixed sets of data or event collections (for example, from Flume and the Kafka ). And this coincides with the Rdd based on a fixed set of data, in fact, the rdd Graph by DStream based on a fixed time interval itinerary It is based on a certain Batch of the data set.

It can be seen that in everyBatch, the spatial dimension of theRDDThe dependencies are the same, the difference is that the fiveBatchthe incoming data is different in size and content, so the resultingRDDinstance of the dependency, so thatRDDof theGraphborn out ofDStreamof theGraph, which meansDStreamis thatRDDtemplates, different time intervals, generate differentRDD Graphinstance.

Starting with the Spark streaming itself:

  1. generate template for RDD DAG :DStream Graph

  2 needs to be based on Timeline of the Job Controller

  3 need to inputstreamings and the outputstreamings that represents the input and output of the data

  4 specific to Job run on Spark Cluster above, because Streaming system fault tolerance is critical, regardless of whether the cluster can be digested

  5 transaction processing, we hope that the data flowing in will be processed and processed only once. How to guarantee the transactional semantics of exactly once in the event of a crash.

Interpreting DStream from the source code

As you can see,DStream is the heart of spark streaming , and the core of Spark Core is RDD, it also has dependency and compute. More critical is the following code:

This is a HashMap, with time as key, with the RDD as value, which is also proof of the passage of time , constantly generate RDD, generate dependency jobs, and run through Jobscheduler on the cluster. Again, DStream is The template for the RDD.

   DStreamcan be said to be a logical level,RDDis the physical level,DStreamAll that is expressed is ultimately throughRDDof the transformation achieved. The former is a higher level of abstraction, the latter being the underlying implementation. DStreamin fact, in the time dimension,RDDthe encapsulation of the collection,DStreamwith theRDDis the relationship that has been generated over time.RDD, theDStreamOperation is done on a fixed time .RDD.

Summarize:

The business logic on the spatial dimension acts onDStream, with the passage of time, eachBatch Intervalformed a specific data set, resulting in aRDD, theRDDmakeTransformoperation, which in turn forms aRDDthe dependency relationshipRDD DAG, formingJob. ThenJobschedulerbased on the time schedule,RDDdependencies, publish the job toSpark Clusterup and running, constantly producingSparkjob.

Note:

Data from:dt_ Big Data Dream Factory (Spark release version customization)

For more private content, please follow the public number:Dt_spark

If you are interested in big Data spark , you can listen to it for free by Liaoliang teacher every night : Spark Permanent free public class, address YY room Number:68917580

Dream_spark Custom Second Lesson

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.