Spark:best practice for retrieving big data from RDD to local machine

Source: Internet
Author: User

' ve got big RDD (1GB) in yarn cluster. On local machine, which use this cluster I has only MB. I ' d like to iterate over the values in the RDD on my local machine. I can ' t use Collect (), because it would create too big array locally which the then my heap. I need some iterative. There is method iterator (), but it requires some additional information, I can ' t provide.

Udp:commited Tolocaliterator method

Apache-spark
ShareImprove this question Edited Apr at 13:56 Asked Feb at 9:55Epahomov111117
Tolocaliterator is not ideal if you want to iterate locally over a partition at a Time–landon Kuhn Oct ' 2:2 5
2
@LandonKuhn why not? – Tom Yubing Dong- 4 ' at 23:02
Add a Comment
5 Answersactiveoldestvotes
Up vote26down voteaccepted

Update: RDD.toLocalIterator Method that appeared after the original answer have been written is a more efficient a-to-do job. It uses to evaluate with a single partition on each runJob step.

TL;DR And the original answer might give a rough idea how it works:

First of all, get the array of partition indexes:

val parts = rdd.partitions

Then create smaller rdds filtering out everything and a single partition. Collect the data from smaller rdds and iterate over values of a single partition:

For (P<-Parts) { ValIdx=P.IndexValPartrdd=Rdd.Mappartitionswithindex(A= If (A._1== Idx)  A.else iterator (),  Span class= "lit" >true)  //the second argument is true to avoid RDD Reshuffling val data =. Collect //data contains all values from a single partition  //in th e form of Array //now you can does with the data whatever your want:iterate, save to a F Ile, Etc.}           

I didn ' t try this code, but it should work. Please write a comment if it won ' t compile. of cause, it would work only if the partitions is small enough. If they aren ' t, you can always increase the number of partitions with  rdd.coalesce (Numparts, True) .

ShareImprove this answer Edited at 8:36 answered Feb at 18:33wildfire4,538 1
Does this code cause all partition to being computed in serial when is it loops through and call Mappartitionswithindex? What's the best-of-the-remedy this? –foboi1122 at 0:42
@foboi1122 updated answer–wildfire at 8:36
@Wildfire'll this approach resolve this. Else how to resolve using any or might is this approach. –chikumiku 2 days ago
Add a Comment
Did you find this question interesting? Try our Newsletter

Newsletter and get our top new questions delivered to your Inbox (see an example).

Up vote10down vote

Wildfire answer seems semantically correct, but I ' m sure you should being able to being vastly more efficient by using the API O F Spark. If you want to the process each partition in turn, I don't see what you can ' t using//// map filter reduce reduceByKey mapPartitions operations. T He only time you ' d want to has everything in one place in one array was when your going to perform a non-monoidal operatio N-but that doesn ' t seem to being what are you want. You should is able to does something like:

rdd.mapPartitions(recordsIterator => your code that processes a single chunk)

Or this

rdd.foreachPartition(partition => {  partition.toArray  // Your code})
ShareImprove this answer Edited APR 3 ' + at 15:55 answered Mar at 11:05samthebest10.4k 5
 &nbs P;  
is ' t these operators execute on cluster?& Nbsp;– epahomov APR 3 ' + at 7:05
yes it would, and why is You avoiding? If you can process each chunk in turn, you should is able to write the code in such a-so it can distribute-like usin G&NBSP aggregate .  – samthebest APR 3 ' in 15:54
Is isn't the iterator returned by forEachPartitition the data iterator for a single partition-and not a iterator of all parti tions? –JAVADBA at 8:23
Add a Comment
Up vote5down vote

Here are the same approach as suggested by @Wildlife but written in Pyspark.

The nice thing about this approach-it lets user access records in RDD in order. I ' m using this code to feed data from the RDD into STDIN of the machine learning tool ' s process.

rdd = sc.parallelize(range(100), 10)def make_part_filter(index):    def part_filter(split_index, iterator):        if split_index == index:            for el in iterator:                yield el    return part_filterfor part_id in range(rdd.getNumPartitions()):    part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True)    data_from_part_rdd = part_rdd.collect()    print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)

Produces output:

  partition id:0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]partition id:1 elements: [10, 11, 12, 13, 14, 15, 16, +, 19]partition id:2 elements: [30, 31, 32, 33, 34, Max, at,,, +, +,]. , 39]partition id:4 elements: [50, 51, 49]partition id:5 elements: [+, A, a, a, a, a.  59]partition id:6 elements: [Up to,---------------------  [79]partition, Id:8, he, he, he, he, he, he, he, he, Yi, elements: [A], Bayi, the ",", "," elements: [All-in-a-94,----]
ShareImprove this answer Edited June 5 ' at 20:14 answered June 5 ' at 20:07Vvladymyrov2,978
Add a Comment
Up vote1down vote

map/filter/reduce using Spark and download the results later? I think usual Hadoop approach would work.

Api says that there is map-filter-saveasfile COMMANDS:HTTPS://SPARK.INCUBATOR.APACHE.ORG/DOCS/0.8.1/ Scala-programming-guide.html#transformations

shareimprove this answer answered feb One" at 10:09 ya_pulser1,2601715
Bad option. I don ' t want to do serialization/deserialization. So I want the This data retrieving from Spark–epahomov Feb one ' at 10:37
How does intend to get 1GB without serde (i.e. storing on the disk)? On a node with 512MB? –scrapcodes"at 9:13
1
By iterating over the RDD. You should is able to get all partition in sequence to send all data item in sequence to the master, which can and pull them off the network and work on them. –interfect "at 18:07
Add a Comment
Up vote1down vote

For Spark 1.3.1, the format is as follows

val parts = rdd.partitions    for (p <- parts) {        val idx = p.index        val partRdd = data.mapPartitionsWithIndex {            case(index:Int,value:Iterator[(String,String,Float)]) =>              if (index == idx) value else Iterator()}        val dataPartitioned = partRdd.collect         //Apply further processing on data                          }

 

Spark:best practice for retrieving big data from RDD to local machine

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.