A thorough understanding of spark streaming through cases kick: spark streaming operating mechanism

Source: Internet
Author: User

Contents of this issue:

  1. Spark Streaming Architecture

2. Spark Streaming operating mechanism

  Key components of the spark Big Data analytics framework: Spark core, spark streaming flow calculation, Graphx graph calculation, mllib machine learning, Spark SQL, Tachyon file system, Sparkr compute engine, and more.

  

  Spark streaming is actually an application built on top of spark core, to build a powerful spark application, spark streaming is a useful reference, spark streaming involves multiple job cross mates, Basically involves all the core components of Spark, mastering the spark streaming is critical.

  Spark Streaming Basic Concept understanding:

1. Discrete stream: (discretized stream, DStream): This is the abstract description of the continuous, real-time data flow within the spark streaming, which is a real-time data stream that we are working on, corresponding to a DStream in spark streaming ;

2. Batch data: The real time stream time is in the batch process, and the data processing is converted into batches of time slices;

3. Time slice or batch processing time interval: The logical level of the data quantitative standards, with time slices as the basis for splitting data;

4. Window Length: The length of time the stream data is overwritten by a window. For example, every 5 minutes to count the past 30 minutes of data, window length is 6, because 30 minutes is the batch interval 6 times times;

5. Sliding time interval: for example, every 5 minutes to count the past 30 minutes of data, window time interval of 5 minutes;

6. Input DStream: A inputdstream is a special DStream that connects spark streaming to an external data source to read data.

7. Receiver: Long (possibly 7x24) runs above excutor, each receiver is responsible for a inuptdstream (such as reading an input stream of a Kafka message). Each receiver, plus inputdstream, will occupy a core/slot;

  

  Each step of the Spark core process is based on the RDD and there is a dependency between the RDD. The RDD in the DAG shows that there are 3 actions that will trigger 3 job,rdd from the bottom up dependency, and the Rdd generation job will be executed specifically. As you can see from Dsteam graph, the logic of Dstream is basically consistent with the RDD, which is based on the RDD and adds time dependence. The Rdd Dag can also be called a spatial dimension, meaning that the entire spark streaming a time dimension, or it can become a space and time dimension.

  

From this perspective, spark streaming can be placed in a coordinate system. Where the y-axis is the operation of the Rdd, the dependency of the Rdd forms the logic of the entire job, and the x-axis is the time. As time passes, a fixed interval (Batch Interval) generates a job instance that runs in the cluster.

 For spark streaming, when data flows from different data sources come in, a fixed set of time intervals will result in a series of immutable datasets or event collections (for example, from Flume and Kafka). This coincides with the RDD based on a fixed set of data, in fact, the RDD graph, which is based on a fixed time interval of Dstream, is based on the data set of one batch.

As can be seen in each batch, the spatial dimension of the RDD dependency is the same, the difference is that the five batch inflow of data size and content is not the same, so that is generated by a different Rdd dependency relationship instances, So the Rdd graph is derived from dstream graph, that is to say Dstream is the Rdd template, different time interval, generate different RDD graph instances.

  from the source code interpretation DStream:

  

  As can be seen from here, Dstream is the core of spark streaming, the core of Spark core is the RDD, it also has dependency and compute. More critical is the following code:

  This is a hashmap, with time as key, with the RDD as value, which is also proof of the constant generation of rdd over time, generating dependency jobs, and running through Jbscheduler on the cluster. Again, Dstream is the template for the RDD.

Dstream can be said to be a logical level, the RDD is the physical level, Dstream is ultimately expressed through the transformation of the RDD to achieve. The former is a higher level of abstraction, the latter being the underlying implementation. Dstream is actually the encapsulation of the RDD set in the Time dimension, and the Dstream and Rdd are the same as the RDD, and the Dstream operation is to operate the RDD on a fixed time.

Summarize:

The business logic on the spatial dimension acts onDStream, with the passage of time, eachBatch Intervalformed a specific data set, resulting in aRDD, theRDDfor TRansformoperation, which in turn forms aRDDthe dependency relationshipRDD DAG, forming Job. Then JObschedulerbased on the time schedule,RDDdependencies, publish the job toSpark Clusterup and running, constantly producingSparkjob.

remark:  
      • Data from: Dt_ Big Data Dream Factory (spark release version customization)
      • DT Big Data Dream Factory public number: Dt_spark
      • Liaoliang teacher every 20:00 free big data real combat yy Live: 68917580

A thorough understanding of spark streaming through cases kick: spark streaming operating mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.