A thorough understanding of spark streaming through cases kick: spark streaming operating mechanism and architecture

Source: Internet
Author: User

Contents of this issue:

  1. Spark Streaming job architecture and operating mechanism

2. Spark Streaming fault tolerant architecture and operating mechanism

  In fact, time does not exist, it is by the sense of the human senses the existence of time, is a kind of illusory existence, at any time things in the universe has been happening.

Spark streaming is like time, always following its running mechanism and architecture is running constantly, no matter how many or fewer applications you write can not jump out of this range.

  

I. The spark streaming mechanism of the job execution process is resolved through the case, the case code is as follows:

Importorg.apache.spark.SparkConfImportorg.apache.spark.streaming. {Seconds, StreamingContext}/*** The Spark online blacklist filter that runs in the Scala development cluster* Background Description: In the ad click Billing system, we filter out the blacklist click Online, to protect the interests of advertisers, only effective advertising click Billing * or in the anti-brush scoring (or traffic) system, filter out invalid votes or ratings or traffic; * Implementation technology: Using transform API directly based on RDD programming for join operations
* * Sina Weibo:http://weibo.com/ilovepains/
* Email : [email protected] */Object ONLINEFOREACHRDD2DB {def main (args:array[string]) {/*** Create a Configuration object for Spark sparkconf, set the runtime configuration information for the SPARK program, * For example, by Setmaster to set the URL of the master of the Spark Cluster to which the program is linked, if set to local, On behalf of the Spark program to run locally, especially suitable for the machine configuration conditions are very poor (such as * only 1G of memory) of beginners **/Val conf=NewSparkconf ()//Create a Sparkconf objectConf.setappname ("Onlineforeachrdd")//set the name of the application, which can be seen in the monitoring interface of the program run//Conf.setmaster ("Spark://master:7077 ")//At this point, the program is in the spark clusterConf.setmaster ("Local[6")) //set the batchduration time interval to control the frequency of job generation and create portals for spark streaming executionVal SSC =NewStreamingContext (Conf, Seconds (5)) Val lines= Ssc.sockettextstream ("Master", 9999) Val Words= Lines.flatmap (_.split ("")) Val wordcounts= Words.map (x = (x, 1)). Reducebykey (_ +_) Wordcounts.foreachrdd {Rdd=rdd.foreachpartition {partitionofrecords={val Connection=connectionpool.getconnection () Partitionofrecords.foreach (record={val SQL= "INSERT into Streaming_itemcount (Item,count) VALUES ('" + record._1 + "'," + record._2 + ")"Val stmt=connection.createstatement (); Stmt.executeupdate (SQL); }) Connectionpool.returnconnection (connection)//return to the pool for future reuse}}} Ssc.start () Ssc.awaittermination ()}}

  The job run mechanism is parsed by running the above code:

1. First, the Start method is called by StreamingContext, and the start method of Jobscheduler is started internally, and the message loop is carried out;

2. Jobgenerator and Receivertacker are constructed within the start of Jobscheduler;

3. Then call the Start method of Jobgenerator and Receivertacker to perform the following actions:

Jobgenerator after the start will continue to generate a job according to Batchduration;

Receivertracker start receiver in spark cluster first (actually start receiversupervisor in executor);

4. After receiver receives the data, it is stored to executor via receiversupervisor;

5. At the same time, the metadata information of the data is sent to the Receivertracker in driver, and the received meta-data information is managed through the Receivedblocktracker within Receivertracker;

6. Each batchinterval will produce a specific job, in fact, the job here is not the job referred to in Spark Core, it is only the dstreamgraph based on the RDD generated by the DAG only;

7. To run the job needs to be submitted to Jobscheduler, in the Jobscheduler through the thread pool way to find a separate thread to submit job to the cluster run, in the thread based on the RDD action trigger job operation;

8. The thread pool can be used in order to improve efficiency because jobs are generated continuously during stream processing. At the same time, it is possible to set up the job's Fair fair dispatching mode, also need multi-threading support;

  

Two. Perspective spark streaming operating mechanism from the perspective of fault-tolerant architecture:

  Spark streaming is a fault-tolerant mechanism based on dstream, Dstream is the creation of an rdd over time, meaning that Dstream operates the RDD at a fixed time, and that fault tolerance is divided into the RDD that is formed each time.

Based on the characteristics of RDD, its fault-tolerant mechanism is mainly two kinds:

01. Based on Checkpoint;

Between stage, is wide dependence, produced shuffle operation, lineage chain is too complex and lengthy, this time need to do checkpoint.

02. Fault tolerance based on lineage (descent):

In general, Spark chooses pedigree fault tolerance because it is expensive to make checkpoints for large datasets.

Considering the Rdd dependency, each stage is internally narrow-dependent, which is generally based on lineage fault-tolerant, convenient and efficient.

Summary: The stage inside do lineage,stage between do checkpoint.

  Note:

      • Data from: Dt_ Big Data Dream Factory (spark release version customization)
      • DT Big Data Dream Factory public number: Dt_spark
      • Liaoliang teacher every 20:00 free big data real combat yy Live: 68917580

A thorough understanding of spark streaming through cases kick: spark streaming operating mechanism and architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.