Here are the same approach as suggested by @Wildlife but written in Pyspark. The nice thing about this approach-it lets user access records in RDD in order. I ' m using this code to feed data from the RDD into STDIN of the machine learning tool ' s process. rdd = sc.parallelize(range(100), 10)def make_part_filter(index): def part_filter(split_index, iterator): if split_index == index: for el in iterator: yield el return part_filterfor part_id in range(rdd.getNumPartitions()): part_rdd = rdd.mapPartitionsWithIndex(make_part_filter(part_id), True) data_from_part_rdd = part_rdd.collect() print "partition id: %s elements: %s" % (part_id, data_from_part_rdd)
Produces output: partition id:0 elements: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]partition id:1 elements: [10, 11, 12, 13, 14, 15, 16, +, 19]partition id:2 elements: [30, 31, 32, 33, 34, Max, at,,, +, +,]. , 39]partition id:4 elements: [50, 51, 49]partition id:5 elements: [+, A, a, a, a, a. 59]partition id:6 elements: [Up to,--------------------- [79]partition, Id:8, he, he, he, he, he, he, he, he, Yi, elements: [A], Bayi, the ",", "," elements: [All-in-a-94,----]
ShareImprove this answer |
Edited June 5 ' at 20:14 |
answered June 5 ' at 20:07Vvladymyrov2,978 |
|