Deepen your understanding of spark RDD (or guess) with a series of destructive behaviors (Python version)

Source: Internet
Author: User
Tags for in range gc overhead limit exceeded spark rdd

This experiment was produced by an experimental case where a data set needs to be maintained, and one of the data needs to be inserted:

Here are the two most of the notation:

Rdd=sc.parallelize ([-1]) for in range (10000):    rdd=rdd.union ( Sc.parallelize ([i]))

Each time you insert data, create a new RDD, and then union.

The consequences are:

Java.lang.OutOfMemoryError:GC Overhead limit exceeded

At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2$ $anonfun $apply$1.apply (unionrdd.scala:69)
At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2$ $anonfun $apply$1.apply (unionrdd.scala:68)
At Scala.collection.indexedseqoptimized$class.foreach (indexedseqoptimized.scala:33)
At Scala.collection.mutable.arrayops$ofref.foreach (arrayops.scala:108)
At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2.apply (unionrdd.scala:68)
At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2.apply (unionrdd.scala:68)

。。。

The No. 2119 cycle times is wrong, but cannot operate on the RDD. In other words, there is no point in the number of successfully inserted bars, anyway.

Method Two: Also quite two, actually is the first kind of two to improve on the basis of

 count=0rdd  =sc.parallelize ([ -1 for  i in  range (10000 =rdd.union (Sc.parallelize ([i]) count  =count+1 if  (Count>100 1) cou NT  =0 

: Org.apache.spark.SparkException:Job aborted due to stage Failure:task serialization Failed:java.lang.StackOverflowEr Ror
Java.io.ObjectOutputStream.writeSerialData (objectoutputstream.java:1508)
Java.io.ObjectOutputStream.writeOrdinaryObject (objectoutputstream.java:1431)
Java.io.ObjectOutputStream.writeObject0 (objectoutputstream.java:1177)
Java.io.ObjectOutputStream.defaultWriteFields (objectoutputstream.java:1547)
Java.io.ObjectOutputStream.writeSerialData (objectoutputstream.java:1508)
Java.io.ObjectOutputStream.writeOrdinaryObject (objectoutputstream.java:1431)
Java.io.ObjectOutputStream.writeObject0 (objectoutputstream.java:1177)

The No. 605 cycle of the error, the same two, similar results, different reasons

There may also be a third experiment:

count=0rdd=sc.parallelize ([-1]) for in range (10000):    rdd=  Rdd.union (Sc.parallelize ([i])). Persist ()    count=count+1    if(count>100):        Rdd.take (1)        count=0

It's almost like the second experiment.

Now let's talk about the reason.

Let's start by analogy with the same thing we do in memory,

We have an array that loops 10,000 times and inserts one element at a time, seemingly without any problems. We do it often.

We often treat the RDD as an array (or list), and now we want to insert an element that seems to be only union. Why not?

The RDD divides operations into two categories: transformation and action. No matter how many times the transformation operation is performed, the RDD does not actually perform the operation, and the operation is triggered only when the action action is executed. In the internal implementation mechanism of the RDD, the underlying interface is based on an iterator, which makes the data access more efficient and avoids the memory consumption of a large number of intermediate results.

It can be seen that the RDD is different from an array. In question one, the union is recorded, not executed, and then the memory is blown out.

The second time, I execute the action once every 100 times, so that the Union will be executed, so clever.

But still hung up, and hang more miserable.

This is because, when calculating, the RDD does not take the results of the last rdd as input, but rather, the initial data is still loaded into the beginning, but, to calculate the things too much, so the stack overflow.

The third time indicates that the persist operation has no obvious effect.

Well, suddenly think of experiment four.

count=0rdd=sc.parallelize ([-1]) for in range (10000):    rdd=  Rdd.union (Sc.parallelize ([i])). Persist ()    count=count+1    if(count>100):        MyArray=rdd.collect ()        rdd=sc.parallelize (myarray)        count=0

The result is still running, but it doesn't feel very good.

This is the case, the amount of data is too large, myarray or memory burst AH.

In summary: The best way is to save the memory in the plug, and then save enough to plug in 10,000.

And in other scenarios, you can create large-scale data, but don't create large-scale transformation operations.

Deepen your understanding of spark RDD (or guess) with a series of destructive behaviors (Python version)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.