Spark:best practice for retrieving big data from RDD to local machine

Last Update:2016-06-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

' ve got big RDD (1GB) in yarn cluster. On local machine, which use this cluster I has only MB. I ' d like to iterate over the values in the RDD on my local machine. I can ' t use Collect (), because it would create too big array locally which the then my heap. I need some iterative. There is method iterator (), but it requires some additional information, I can ' t provide.

Udp:commited Tolocaliterator method

Apache-spark

ShareImprove this question

Edited Apr at 13:56

Asked Feb at 9:55Epahomov111117

Tolocaliterator is not ideal if you want to iterate locally over a partition at a Time–landon Kuhn Oct ' 2:2 5

@LandonKuhn why not? – Tom Yubing Dong- 4 ' at 23:02

Add a Comment

5 Answersactiveoldestvotes

Up vote26down voteaccepted

Update: RDD.toLocalIterator Method that appeared after the original answer have been written is a more efficient a-to-do job. It uses to evaluate with a single partition on each runJob step.

TL;DR And the original answer might give a rough idea how it works:

First of all, get the array of partition indexes:

val parts = rdd.partitions

Then create smaller rdds filtering out everything and a single partition. Collect the data from smaller rdds and iterate over values of a single partition:

For (P<-Parts) { ValIdx=P.IndexValPartrdd=Rdd.Mappartitionswithindex(A= If (A._1== Idx)  A.else iterator (),  Span class= "lit" >true)  //the second argument is true to avoid RDD Reshuffling val data =. Collect //data contains all values from a single partition  //in th e form of Array //now you can does with the data whatever your want:iterate, save to a F Ile, Etc.}

I didn ' t try this code, but it should work. Please write a comment if it won ' t compile. of cause, it would work only if the partitions is small enough. If they aren ' t, you can always increase the number of partitions with rdd.coalesce (Numparts, True) .

ShareImprove this answer

Edited at 8:36

answered Feb at 18:33wildfire4,538 1

Does this code cause all partition to being computed in serial when is it loops through and call Mappartitionswithindex? What's the best-of-the-remedy this? –foboi1122 at 0:42

@foboi1122 updated answer–wildfire at 8:36

@Wildfire'll this approach resolve this. Else how to resolve using any or might is this approach. –chikumiku 2 days ago

Add a Comment

Did you find this question interesting? Try our Newsletter

Newsletter and get our top new questions delivered to your Inbox (see an example).

Up vote10down vote

Wildfire answer seems semantically correct, but I ' m sure you should being able to being vastly more efficient by using the API O F Spark. If you want to the process each partition in turn, I don't see what you can ' t using//// map filter reduce reduceByKey mapPartitions operations. T He only time you ' d want to has everything in one place in one array was when your going to perform a non-monoidal operatio N-but that doesn ' t seem to being what are you want. You should is able to does something like:

rdd.mapPartitions(recordsIterator => your code that processes a single chunk)

Or this

rdd.foreachPartition(partition => {  partition.toArray  // Your code})

ShareImprove this answer

Edited APR 3 ' + at 15:55

answered Mar at 11:05samthebest10.4k 5

&nbs P;

is ' t these operators execute on cluster?& Nbsp;– epahomov APR 3 ' + at 7:05

yes it would, and why is You avoiding? If you can process each chunk in turn, you should is able to write the code in such a-so it can distribute-like usin G&NBSP aggregate . – samthebest APR 3 ' in 15:54

Is isn't the iterator returned by forEachPartitition the data iterator for a single partition-and not a iterator of all parti tions? –JAVADBA at 8:23

Add a Comment

Up vote5down vote

Here are the same approach as suggested by @Wildlife but written in Pyspark.

The nice thing about this approach-it lets user access records in RDD in order. I ' m using this code to feed data from the RDD into STDIN of the machine learning tool ' s process.

rdd = sc.parallelize(range(100), 10)def make_part_filter(index):    def part_filter(split_index, iterator):        if split_index == index:            for el in iterator:                yield el    return part_filterfor part_id in range(rdd.getNumPartitions()):    part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)    data_from_part_rdd = part_rdd.collect()    print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)

Produces output:

  partition id:0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]partition id:1 elements: [10, 11, 12, 13, 14, 15, 16, +, 19]partition id:2 elements: [30, 31, 32, 33, 34, Max, at,,, +, +,]. , 39]partition id:4 elements: [50, 51, 49]partition id:5 elements: [+, A, a, a, a, a.  59]partition id:6 elements: [Up to,---------------------  [79]partition, Id:8, he, he, he, he, he, he, he, he, Yi, elements: [A], Bayi, the ",", "," elements: [All-in-a-94,----]

ShareImprove this answer

Edited June 5 ' at 20:14

answered June 5 ' at 20:07Vvladymyrov2,978

Add a Comment

Up vote1down vote

map/filter/reduce using Spark and download the results later? I think usual Hadoop approach would work.

Api says that there is map-filter-saveasfile COMMANDS:HTTPS://SPARK.INCUBATOR.APACHE.ORG/DOCS/0.8.1/ Scala-programming-guide.html#transformations

shareimprove this answer

answered feb One" at 10:09 ya_pulser1,2601715

Bad option. I don ' t want to do serialization/deserialization. So I want the This data retrieving from Spark–epahomov Feb one ' at 10:37

How does intend to get 1GB without serde (i.e. storing on the disk)? On a node with 512MB? –scrapcodes"at 9:13

By iterating over the RDD. You should is able to get all partition in sequence to send all data item in sequence to the master, which can and pull them off the network and work on them. –interfect "at 18:07

Add a Comment

Up vote1down vote

For Spark 1.3.1, the format is as follows

val parts = rdd.partitions    for (p <- parts) {        val idx = p.index        val partRdd = data.mapPartitionsWithIndex {            case(index:Int,value:Iterator[(String,String,Float)]) =>              if (index == idx) value else Iterator()}        val dataPartitioned = partRdd.collect         //Apply further processing on data                          }

Spark:best practice for retrieving big data from RDD to local machine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark:best practice for retrieving big data from RDD to local machine

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark:best practice for retrieving big data from RDD to local machine

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support