Spark example: Sorting by array and spark example

Source: Internet
Author: User

Spark example: Sorting by array and spark example

Array sorting is a common operation. The lower performance limit of a comparison-based sorting algorithm is O (nlog (n), but in a distributed environment, we can improve the performance. Here we show the implementation of array sorting in Spark, analyze the performance, and try to find the cause of performance improvement.

Official example

import sys from pyspark import SparkContext if __name__ == "__main__":    if len(sys.argv) != 2:        print >> sys.stderr, "Usage: sort <file>"        exit(-1)    sc = SparkContext(appName="PythonSort")    lines = sc.textFile(sys.argv[1], 1)    sortedCount = lines.flatMap(lambda x: x.split(' ')) \        .map(lambda x: (int(x), 1)) \        .sortByKey(lambda x: x)    # This is just a demo on how to bring all the sorted data back to a single node.    # In reality, we wouldn't want to collect all the data to the driver node.    output = sortedCount.collect()    for (num, unitcount) in output:        print num     sc.stop()

All Spark application portals are SparkContext instances. It is an abstraction of the entire Spark environment. Considering the complexity of the distributed environment, if you need to consider the dataset Division during programming and which machine is computed on it, the availability of Spark will be greatly reduced. The SparkContext instance is the entrance to the entire environment. Just like the car's operation interface, when programming, you only need to call the corresponding interface and input data, it will be distributed and executed in the distributed system, and maximize the performance. At the end of the program, you must call the stop method to disconnect the environment.

Method textFile reads a text file and creates an RDD set in the Spark environment. This dataset is stored in the lines variable. The flatMap method is different from the map method. The map returns a key-value pair, and the obtained RDD set and hash table are somewhat similar. The output result of flatMap is an array. This array is the result of calling the passed lambda function for each element. This means that the input lambda function can return 0 or more results, for example:

>>> rdd = sc.parallelize([2, 3, 4])>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())[1, 1, 1, 2, 2, 3]

Call the map method to convert each array to the form of (key, value), and finally use sortByKey for sorting. After sorting the order, the RDD set is sortedCount. You can call the collect method to return the data in the set.

It is easy to see that this program actually calls the sortByKey sorting interface provided by Spark, rather than implementing a sorting algorithm in the code. The underlying implementation of sortByKey is as follows:

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size)      : RDD[(K, V)] =  {    val part = new RangePartitioner(numPartitions, self, ascending)    new ShuffledRDD[K, V, V](self, part)      .setKeyOrdering(if (ascending) ordering else ordering.reverse)  }

The idea is to divide the data into different intervals (the number of intervals defaults to the number of RDD partitions), and then use the Sorting Algorithm for each interval separately. This implementation is very elegant and only uses two lines of code. Assume that the number of partitions is m and the dataset size is n. It is expected that the time complexity is O (n/m) log (n )). A Spark partition executes a task, and the final result is the same as that of parallel execution.

Visit my blog for more information: magic01

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.