Deepen your understanding of spark RDD (or guess) with a series of destructive behaviors (Python version)

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This experiment was produced by an experimental case where a data set needs to be maintained, and one of the data needs to be inserted:

Here are the two most of the notation:

Rdd=sc.parallelize ([-1]) for in range (10000):    rdd=rdd.union ( Sc.parallelize ([i]))

Each time you insert data, create a new RDD, and then union.

The consequences are:

Java.lang.OutOfMemoryError:GC Overhead limit exceeded

At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2$ $anonfun $apply$1.apply (unionrdd.scala:69)
At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2$ $anonfun $apply$1.apply (unionrdd.scala:68)
At Scala.collection.indexedseqoptimized$class.foreach (indexedseqoptimized.scala:33)
At Scala.collection.mutable.arrayops$ofref.foreach (arrayops.scala:108)
At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2.apply (unionrdd.scala:68)
At org.apache.spark.rdd.unionrdd$ $anonfun $getpartitions$2.apply (unionrdd.scala:68)

。。。

The No. 2119 cycle times is wrong, but cannot operate on the RDD. In other words, there is no point in the number of successfully inserted bars, anyway.

Method Two: Also quite two, actually is the first kind of two to improve on the basis of

 count=0rdd  =sc.parallelize ([ -1 for  i in  range (10000 =rdd.union (Sc.parallelize ([i]) count  =count+1 if  (Count>100 1) cou NT  =0

: Org.apache.spark.SparkException:Job aborted due to stage Failure:task serialization Failed:java.lang.StackOverflowEr Ror
Java.io.ObjectOutputStream.writeSerialData (objectoutputstream.java:1508)
Java.io.ObjectOutputStream.writeOrdinaryObject (objectoutputstream.java:1431)
Java.io.ObjectOutputStream.writeObject0 (objectoutputstream.java:1177)
Java.io.ObjectOutputStream.defaultWriteFields (objectoutputstream.java:1547)
Java.io.ObjectOutputStream.writeSerialData (objectoutputstream.java:1508)
Java.io.ObjectOutputStream.writeOrdinaryObject (objectoutputstream.java:1431)
Java.io.ObjectOutputStream.writeObject0 (objectoutputstream.java:1177)

The No. 605 cycle of the error, the same two, similar results, different reasons

There may also be a third experiment:

count=0rdd=sc.parallelize ([-1]) for in range (10000):    rdd=  Rdd.union (Sc.parallelize ([i])). Persist ()    count=count+1    if(count>100):        Rdd.take (1)        count=0

It's almost like the second experiment.

Now let's talk about the reason.

Let's start by analogy with the same thing we do in memory,

We have an array that loops 10,000 times and inserts one element at a time, seemingly without any problems. We do it often.

We often treat the RDD as an array (or list), and now we want to insert an element that seems to be only union. Why not?

The RDD divides operations into two categories: transformation and action. No matter how many times the transformation operation is performed, the RDD does not actually perform the operation, and the operation is triggered only when the action action is executed. In the internal implementation mechanism of the RDD, the underlying interface is based on an iterator, which makes the data access more efficient and avoids the memory consumption of a large number of intermediate results.

It can be seen that the RDD is different from an array. In question one, the union is recorded, not executed, and then the memory is blown out.

The second time, I execute the action once every 100 times, so that the Union will be executed, so clever.

But still hung up, and hang more miserable.

This is because, when calculating, the RDD does not take the results of the last rdd as input, but rather, the initial data is still loaded into the beginning, but, to calculate the things too much, so the stack overflow.

The third time indicates that the persist operation has no obvious effect.

Well, suddenly think of experiment four.

count=0rdd=sc.parallelize ([-1]) for in range (10000):    rdd=  Rdd.union (Sc.parallelize ([i])). Persist ()    count=count+1    if(count>100):        MyArray=rdd.collect ()        rdd=sc.parallelize (myarray)        count=0

The result is still running, but it doesn't feel very good.

This is the case, the amount of data is too large, myarray or memory burst AH.

In summary: The best way is to save the memory in the plug, and then save enough to plug in 10,000.

And in other scenarios, you can create large-scale data, but don't create large-scale transformation operations.

Deepen your understanding of spark RDD (or guess) with a series of destructive behaviors (Python version)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More