Spark streaming hollow Rdd handling and flow handler graceful stop

Source: Internet
Author: User

Contents of this issue:

    • Empty RDD processing in Spark streaming
    • Spark Streaming Program Stop

  

Since each batchduration of spark streaming will constantly produce the RDD, the empty rdd has great probability, and how to deal with it will affect the efficiency of its operation and the efficient use of resources.

Spark streaming will continue to receive data, without knowing what state the data is received, and if you force it to stop, it will involve incomplete operations or consistency related issues.

One, the empty rdd processing in Spark streaming:

Foreachrdd is the method (operator) that produces the core of the Dstreams real action operation.

  

  

  While the data is being written to the database, when the Rdd is empty, if the foreachpartition and write database operations are also performed, or if the data is stored in HDFs, there is no need to get the compute resources at this point.

  

  How to maximize resource savings and improve efficiency? Increase your judgment before processing:

  

  It is still not ideal to judge on the face of data, because the count operation starts the job or wastes resources, and we comb the following methods:

  

  If there are several partitions, but the contents of partition are empty, take may start the job:

  

  What happens if there is no data:

  

  

  

 Operations when the data is empty:

  

  

 From the above can be observed, will generate RDD, but there is no partition in the RDD, no data will not be generated block, but will generate an rdd, but there is no partition.

  

  There are partition but will not be executed if there is no blockid:

  

  

 Summarize:

In fact, you can not generate the RDD, because it is necessary to maintain a concept, each bachduration will produce a job,job if there is no rdd can not be produced;

The job is generated at each interval, and if there is no job at the time of submission, what action does your action take, and on the surface it does not produce an RDD efficiency;

But at the scheduling level depends on each batchduration generated job, the scheduling level to determine whether there is an rdd, no RDD job will not be able to execute.

Second, the Spark streaming program stops:

  In general, how does spark streaming stop?

  

  

  

 The above stop mode will stop this streams, but will not wait for all data processing to complete by default Sparkcontext will also be stopped.

  

Use the Stopgracefully method to process:

  

  

  

  When the application starts, it calls Stoponshutdown, and the callback is passed in.

  

  If the prompt data is not processed, it is stopped:

  

  Summary: Using stopgracefully all received data will be processed to complete before it is stopped.

Note:

      • Data from: Liaoliang (Spark release version customization)
      • Sina Weibo:http://www.weibo.com/ilovepains

Spark streaming hollow Rdd handling and flow handler graceful stop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.