Spark example: Sorting by array and spark example
Array sorting is a common operation. The lower performance limit of a comparison-based sorting algorithm is O (nlog (n), but in a distributed environment, we can improve the performance. Here we show the implementation of array sorting in Spark, analyze the performance, and try to find the cause of performance improvement.
Official example
import sys from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print >> sys.stderr, "Usage: sort <file>" exit(-1) sc = SparkContext(appName="PythonSort") lines = sc.textFile(sys.argv[1], 1) sortedCount = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (int(x), 1)) \ .sortByKey(lambda x: x) # This is just a demo on how to bring all the sorted data back to a single node. # In reality, we wouldn't want to collect all the data to the driver node. output = sortedCount.collect() for (num, unitcount) in output: print num sc.stop()
All Spark application portals are SparkContext instances. It is an abstraction of the entire Spark environment. Considering the complexity of the distributed environment, if you need to consider the dataset Division during programming and which machine is computed on it, the availability of Spark will be greatly reduced. The SparkContext instance is the entrance to the entire environment. Just like the car's operation interface, when programming, you only need to call the corresponding interface and input data, it will be distributed and executed in the distributed system, and maximize the performance. At the end of the program, you must call the stop method to disconnect the environment.
Method textFile reads a text file and creates an RDD set in the Spark environment. This dataset is stored in the lines variable. The flatMap method is different from the map method. The map returns a key-value pair, and the obtained RDD set and hash table are somewhat similar. The output result of flatMap is an array. This array is the result of calling the passed lambda function for each element. This means that the input lambda function can return 0 or more results, for example:
>>> rdd = sc.parallelize([2, 3, 4])>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())[1, 1, 1, 2, 2, 3]
Call the map method to convert each array to the form of (key, value), and finally use sortByKey for sorting. After sorting the order, the RDD set is sortedCount. You can call the collect method to return the data in the set.
It is easy to see that this program actually calls the sortByKey sorting interface provided by Spark, rather than implementing a sorting algorithm in the code. The underlying implementation of sortByKey is as follows:
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size) : RDD[(K, V)] = { val part = new RangePartitioner(numPartitions, self, ascending) new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }
The idea is to divide the data into different intervals (the number of intervals defaults to the number of RDD partitions), and then use the Sorting Algorithm for each interval separately. This implementation is very elegant and only uses two lines of code. Assume that the number of partitions is m and the dataset size is n. It is expected that the time complexity is O (n/m) log (n )). A Spark partition executes a task, and the final result is the same as that of parallel execution.
Visit my blog for more information: magic01