Array sorting is a common operation. The comparison-based sorting algorithm has a lower performance limit of O (Nlog (n)), but in a distributed environment we can concurrency to improve performance. This shows the implementation of array sequencing in Spark and analyzes performance while trying to find the cause of performance gains.
Official example
Import sys from pyspark import sparkcontext if __name__ = = "__main__": If Len (sys.argv)! = 2: print >> SYS.S Tderr, "Usage:sort <file>" exit ( -1) sc = sparkcontext (appname= "Pythonsort") lines = Sc.textfile ( SYS.ARGV[1], 1) Sortedcount = Lines.flatmap (Lambda x:x.split (")) . Map (lambda x: (int (x), 1)) . Sortbykey ( Lambda x:x) # This is just a demo in how to bring all the sorted data back to a single node. # in reality, we wouldn ' t want to collect all the data to the driver node. Output = Sortedcount.collect () for (num, unitcount) in output: print num sc.stop ()
AllSparkthe entrance to the application isSparkcontextthe instance. It is the wholeSparkan abstraction of the environment. Consider the complexity of the distribution environment, if you want to consider the partitioning of the data set and which machine is calculated, thenSparkwill be greatly reduced in availability. Sparkcontextexample is the entire environment of the entrance, just like the car's operating interface, programming as long as the appropriate interface to call, the incoming data, it will be distributed and executed in the distribution system, and maximize performance. At the end of the program, you also call itsStopmethod to disconnect the environment.
Methodtextfileread into a text file andSparkenvironment to create the correspondingRDDset. This data set is stored in theLinesvariable. MethodFlatMapand theMapdifferent,Mapreturned is aKey,valueyes, got it.RDDSet and hash table a bit like. andFlatMapthe output of the result is an array. This array is called for each element of the incomingLambdafunction, and the result is obtained after the This means that the incomingLambdafunction can return0One or more results, such as:
>>> Rdd = Sc.parallelize ([2, 3, 4]) >>> sorted (Rdd.flatmap (lambda x:range (1, x)). Collect ()) [1, 1, 1, 2, 2, 3]
Then callMapmethod to turn each array into a(Key,value)in the form of the last useSortbykeyto sort by. After a good orderRDDset isSortedcount, call itsCollectmethod to return the data inside the set.
It is easy to see that this program actually invokes the sort interface Sortbykey provided by Spark , rather than implementing a sort algorithm in the code. The underlying implementation of method Sortbykey is as follows:
def sortbykey (Ascending:boolean = true, Numpartitions:int = self.partitions.size) : rdd[(K, V)] = { val part = new Rangepartitioner (numpartitions, Self, ascending) new shuffledrdd[k, V, v] (self, part) . setkeyordering (if (ascending) ordering else ordering.reverse) }
The idea is to divide the data into areas that you don't want to cross (the number of intervals defaults toRDDnumber of partitions), and then use the sorting algorithm separately for each interval. This implementation is elegant, with only two lines of code. Assuming that the number of partitions ism,the size of the data set isN, it can be expected that its time complexity isO ((n/m) log (n)). Sparkone of the partitions will perform aTask, the final effect is the same as parallel execution.
Visit my blog For more information: MAGIC01
Array ordering for the Spark sample