Spark example: Sorting by array and spark example

Last Update:2015-02-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Array sorting is a common operation. The lower performance limit of a comparison-based sorting algorithm is O (nlog (n), but in a distributed environment, we can improve the performance. Here we show the implementation of array sorting in Spark, analyze the performance, and try to find the cause of performance improvement.

Official example

import sys from pyspark import SparkContext if __name__ == "__main__":    if len(sys.argv) != 2:        print >> sys.stderr, "Usage: sort <file>"        exit(-1)    sc = SparkContext(appName="PythonSort")    lines = sc.textFile(sys.argv[1], 1)    sortedCount = lines.flatMap(lambda x: x.split(' ')) \        .map(lambda x: (int(x), 1)) \        .sortByKey(lambda x: x)    # This is just a demo on how to bring all the sorted data back to a single node.    # In reality, we wouldn't want to collect all the data to the driver node.    output = sortedCount.collect()    for (num, unitcount) in output:        print num     sc.stop()

All Spark application portals are SparkContext instances. It is an abstraction of the entire Spark environment. Considering the complexity of the distributed environment, if you need to consider the dataset Division during programming and which machine is computed on it, the availability of Spark will be greatly reduced. The SparkContext instance is the entrance to the entire environment. Just like the car's operation interface, when programming, you only need to call the corresponding interface and input data, it will be distributed and executed in the distributed system, and maximize the performance. At the end of the program, you must call the stop method to disconnect the environment.

Method textFile reads a text file and creates an RDD set in the Spark environment. This dataset is stored in the lines variable. The flatMap method is different from the map method. The map returns a key-value pair, and the obtained RDD set and hash table are somewhat similar. The output result of flatMap is an array. This array is the result of calling the passed lambda function for each element. This means that the input lambda function can return 0 or more results, for example:

>>> rdd = sc.parallelize([2, 3, 4])>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())[1, 1, 1, 2, 2, 3]

Call the map method to convert each array to the form of (key, value), and finally use sortByKey for sorting. After sorting the order, the RDD set is sortedCount. You can call the collect method to return the data in the set.

It is easy to see that this program actually calls the sortByKey sorting interface provided by Spark, rather than implementing a sorting algorithm in the code. The underlying implementation of sortByKey is as follows:

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size)      : RDD[(K, V)] =  {    val part = new RangePartitioner(numPartitions, self, ascending)    new ShuffledRDD[K, V, V](self, part)      .setKeyOrdering(if (ascending) ordering else ordering.reverse)  }

The idea is to divide the data into different intervals (the number of intervals defaults to the number of RDD partitions), and then use the Sorting Algorithm for each interval separately. This implementation is very elegant and only uses two lines of code. Assume that the number of partitions is m and the dataset size is n. It is expected that the time complexity is O (n/m) log (n )). A Spark partition executes a task, and the final result is the same as that of parallel execution.

Visit my blog for more information: magic01

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark example: Sorting by array and spark example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark example: Sorting by array and spark example

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support