How can I beat Java's built-in sorting algorithm?

Source: Internet
Author: User

How can I beat Java's built-in sorting algorithm?

Java 8 optimizes its own sorting algorithm. For integer and other basic types, Arrays. sort () utilizes the double pivot fast sorting, Merge Sorting, and heuristic insertion sorting. This algorithm is very powerful and can be used in many cases. More variable types are supported for large-scale arrays. I compared my self-written sort algorithms with the built-in Java algorithms to see if they can compete. These experiments include handling specific situations.

First, I wrote a classic quick sorting algorithm. This algorithm calculates the average value of the sample to estimate the center point of the entire array and then serves as the initial pivot.

I used some Java ideas to improve my quick sorting. The modified algorithm directly calls insert sorting when sorting small arrays. In this case, my Sorting Algorithm and Java sorting algorithm can reach the same running time. Wild & al pointed out that if the sorting array has a lot of duplicate data, the standard quick sorting will be faster than the double pivot quick sorting. I have not tried any byte or assembly-level analysis and optimization. In most cases, the optimization programs of my version are far from comparable to those of Java system programs.

I always wanted to test a simple Sorting Algorithm in my mind, which I call Bleedsort. This is a distributed algorithm. It uses the sample sampling method to estimate the distribution of the array to be sorted. The data is allocated to a temporary array, as shown in 1 ), and rewrite the initial array. This is a preprocessing process, and then other sorting algorithms are applied to sort the data separately. In my tests, I used the Quick Sort version I wrote. If Merge Sorting is used, it should have better results, because Merge Sorting is widely used in highly structured arrays. For ease of calculation, I only tested evenly distributed data.

Bleedsort is placed on the right when the same data is encountered, so this algorithm performs poorly in sorting arrays with relatively consistent Translator note: there will be a lot of duplicate data. So I need to estimate the samples of the sorted array. When there are many duplicates, we should avoid using the Bleedsort algorithm.

I know that the Bleedsort algorithm cannot be used in the memory space with the merge sort and fast sorting), and the temporary array is about four times larger than the original array. At the same time, other distributed sorting algorithms, such as Flashsort, are also doing better in this aspect.

Figure 1 Bleedsort example

I use JMH as the benchmark. For the sake of simplicity, I will use an integer array for testing. In an array of evenly distributed orders of magnitude 1000.000 to 10.000.0000, my algorithm performs best. Although the fast sorting algorithm I wrote is somewhat inferior to the built-in Java Algorithm, however, my pre-processing process makes up for these shortcomings and calls the Bleedsort 87 ms in my quick sorting, Java's built-in algorithm 105 ms, and 938 ms vs 1.144 s)

Benchmark Mode Cnt Score Error Units Corrected

MyBenchmark. _ 1e6U sample 8512 0.024 ± 0.001 s/op

MyBenchmark. _ 1e7U sample 985 0.236 ± 0.001 s/op

I have generated the following correct base Array

Myworkshop. int1e6UQuickSort sample 1641 0.131 ± 0.001 s/op 0.107 ± 0.002

Myworkshop. int1e6UBleedSort sample 2410 0.087 ± 0.001 s/op 0.063 ± 0.002

Myworkshop. int1e6u0000ort sample 1978 0.105 ± 0.001 s/op 0.081 ± 0.002

Myworkshop. int1e7UQuickSort sample 200 1.483 ± 0.014 s/op 1.459 ± 0.015

Myworkshop. int1e7UBleedSort sample 373 0.938 ± 0.009 s/op 0.914 ± 0.010

Myworkshop. int1e7u0000ort sample 200 1.144 ± 0.009 s/op 1.120 ± 0.010

Therefore, my algorithm programs without special optimization are about 10-15% faster than the built-in Java algorithms in these datasets.

My algorithms do not show any bad performance in even increasing data sets with 1000.000 or 10% random duplicate data.

Benchmark Mode Cnt Score Error Units Corrected

. _ 1e6Iwf010 sample 20705 9.701 ± 0.033 MS/op

. _ 1e6Iwf001 sample 148693 1.344 ± 0.003 MS/op

Generate the correct base Array

. Int1e6Iw010BleedSort sample 4159 49.377 ± 0.571 MS/op 39.68 ± 0.60

. Int1e6iw0100000ort sample 3937 52.139 ± 0.229 MS/op 42.44 ± 0.25

. Int1e6Iw010QuickSort sample 3899 52.457 ± 0.210 MS/op 42.76 ± 0.23

10% duplicate data

. Int1e6Iw001BleedSort sample 6190 32.821 ± 0.219 MS/op 31.48 ± 0.22

. Int1e6iw001sort ORT sample 8113 24.910 ± 0.079 MS/op 23.57 ± 0.08

. Int1e6Iw001QuickSort sample 8653 23.367 ± 0.056 MS/op 22.02 ± 0.06

^^ 1%

However, this algorithm only has about 10.000 of the data sets with a small item distribution ~ Bin (100, 0.5) Translator's note: Considering the formula code in the brackets, the internal English parentheses are not modified into Chinese characters. In these arrays, the average number of occurrences of the number 50 is 795.5, and the number of occurrences of 40 repeated arrays is 108.4.

In addition, this algorithm is about twice slower than Arrays. sort () when sorting large Arrays of magnitude 1000.0000. These arrays all have a lot of repeated data. For example, some arrays with a size of 1e6 have only 450 different values ).

Benchmark Mode Cnt Score Error Units Corrected

. _ 1e4bin100 sample 152004 1.316 ± 0.001 MS/op

^ For correction

. Int1e4bin100BleedSort sample 148681 1.345 ± 0.001 MS/op 0.029 ± 0.002

. Int1e4bin1000000ort sample 150864 1.326 ± 0.001 MS/op 0.010 ± 0.002

. Int1e4bin100QuickSort sample 146852 1.362 ± 0.001 MS/op 0.046 ± 0.002

. Int1e6bin1e4BleedSort sample 75344 2.654 ± 0.005 MS/op-

. Int1e6bin1e40000ort sample 146801 1.361 ± 0.002 MS/op-

. Int1e6bin1e4QuickSort sample 76467 2.615 ± 0.005 MS/op-

In a small 10.000, 100.000) Even random array, this algorithm is acceptable, but not better than the system algorithm.

MyBench. int1e4UBleedSort sample 216492 0.924 ± 0.001 MS/op 0.683 ± 0.002

Myworkshop. int1e4u0000ort sample 253489 0.789 ± 0.001 MS/op 0.548 ± 0.002

Myworkshop. int1e4UQuickSort sample 217394 0.920 ± 0.001 MS/op 0.679 ± 0.002

Myworkshop. int1e5UBleedSort sample 18752 0.011 ± 0.001 s/op 0.009 ± 0.002

Myworkshop. int1e5u0000ort sample 22335 0.009 ± 0.001 s/op 0.007 ± 0.002

Myworkshop. int1e5UQuickSort sample 18748 0.011 ± 0.001 s/op 0.009 ± 0.002

All in all, when the memory is not very tight, we recommend that you use the distributed search algorithm as an effective supplementary option for appropriate large datasets.

Finally, let's take a look at the binds (100, 0.5) and bin (1000, 0.5) of the two distributions ),

Here, two datasets with 100 data randomly sampled are generated using the R language ).

> Rbinom (100,100, 0.5)

[1] 43 49 51 47 49 40 46 51 50 49 45 50 51 50 53 45 53 48 56 45

[26] 47 55 47 53 53 56 41 47 51 51 46 49 49 52 46 48 49 48 56 54 53 52

[51] 54 48 45 45 50 48 49 52 50 48 49 45 54 50 41 53 45 48 53 52

[76] 50 53 47 55 47 60 54 52 56 45 46 46 38 43 53 45 62 48 52 52 49 52 56

> Rbinom (100,100 0, 0.5)

[1] 515 481 523 519 524 516 498 473 523 514 483 496 458 506 507 491 514 489

[19] 475 489 485 507 486 523 521 492 502 500 503 501 504 482 518 506 498 525

[37] 498 491 492 479 506 499 505 497 510 479 504 510 485 488 495 519 522 490

[55] 517 511 511 488 519 508 475 521 505 493 480 498 490 492 492 476 490 506

[73] 496 505 521 518 506 509 477 483 509 493 497 501 483 502 470 515 519 509

[91] 510 496 477 508 506 481 490 511 498

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.