Search for the eight major elements of the three algorithms, source code and expansion

Source: Internet
Author: User

First, the problem description

The so-called "first" K-large number problem refers to the problem of finding the number of first (former) K from the large to the small order in the chaotic array of length n (n>=k).

The K-Question can be a real problem, such as the K-rank in the bid ranking, or the K-big price of multiple bidders, and so on.

Second, the Solution inductive solution 1: We can the random ordinal group according to from the big to the small first sort, then take out before the K big, the total time complexity is O (N*logn + k).

It's a good idea to sort all the elements with a quick row and then find the K element.

Solution 2: With the choice of sorting or interactive sorting, K-times can be selected after the number of K-large. The total time complexity is O (n*k).

is also the primary solution, and very chicken. Fast row only the first time complexity is high, the subsequent query time complexity is constant. The choice of sorting, whether it's a sort or a query, has a high degree of time complexity.

Solution 3: Using the idea of fast sequencing, we randomly find an element dx from the array s, dividing the array into two parts SA and SB. Elements in the SA are greater than or equal to X,SB elements less than x. Time complexity can reach approximate o (n)There are two situations:

1. The number of elements in SA is less than K, then the k-| in SB The sa| element is the largest number of K;

2. If the number of elements in the SA is greater than or equal to K, the number of K in the SA is returned. The time complexity is approximately o (n).

3. Pass the above two steps until you find it.

Solution 4: Build the largest heap on the original number, then pop out k times. Time Complexity of O (n+ K*logn)

This is actually a sort of algorithm.

Solution 5: Maintain a K-size minimum heap, for each element in the array to determine the size of the heap top, if the heap is large, then regardless of, otherwise, pop the heap top, the current value is inserted into the heap. Time complexity O (n * logk)

The biggest advantage of this algorithm is that if the array is very large, the use of ordinary sort is burst memory. If you use it, you only use the memory of K.

Solution 6: Optimal solution of solution 4, incomplete reconstruction after heap construction

Similar to the above idea 4, the difference is that the element array in situ to build the largest heap O (n), and then extract the K, but each time the extraction, the element to the top only need to move down at most k times is sufficient , the number of moves down successive reduction (and the above ideas 7 each extraction requires Logn, so extract k times, Train of thought 7 need K*logn. And this idea 8 only need k^2). The complexity of this method is O (n+k^2). Each time a k is extracted, the lower the maximum number of times the element that changes to the top of the shift is moved down, guaranteeing the worst-case efficiency.

Solution 7: Use the hash to save the array of elements SI occurrences of the number of times, using the idea of counting sequencing, linear from large to small scanning process, preceded by the number of k-1 is the K large number, the average time complexity O (n). Solution 8: Algorithm from the Bible, BFPRT algorithm. Also knownThe middle term method of the five divisions. The method is similar to Solution 3, but the complexity of O (N) can be achieved in the worst case due to the high choice of the pivot element.

Here are the main steps of the algorithm:

Some of the details of the treatment, mainly boundary problems or more critical, will be given these problems later.

The data has n, take out the smallest K number

Termination condition: When n=1, it is the I small element that is returned.

Algorithm steps:

     Step1: The n elements are divided into groups of N/5 (upper bound) for each of the 5 groups, and the last group has an element number of n%5, and the number of valid groups is N/5.     Step2: Take out the median of each group, the last group does not calculate the median, arbitrary sorting method, here the data is less than 5,                  you can use a simple bubble sort or insert sort.     step3: Exchange The median of each group with the data at the beginning of the array in the order of the groups so that the median of each group is on the left side of the data. The                  recursive call median selection algorithm finds the median of the median of all the groups in the previous step, set to X, and an even number of the median is set to select one of the smaller middle digits.     Step4: by x, greater than or equal to X on the right, less than X on the left, on the SETP4 data division, the median left or right will have some effect.                  The following code debugs will be seen.     STEP5:STEP4 After the data is divided to return a table i,i the left element is less than x,i to the right of the element including I are both greater than or equal to X.                  if I= =K, return x,                  if I<K, in the element less than x recursively find the small element I,                  if I>k, Recursively finds the element i-k small in an element greater than or equal to X.

I posted a code implementation on GitHub: Click to view

Three, the median problem

The median problem is actually a self-question of the K-major problem. You can use the algorithm of all the K-major problems to answer. Here are some of the more stringent median issues.

1. Dynamic Median lookup. The implementation inserts elements within a logarithmic time, finds the median in constant time, and deletes the median in a logarithmic time.

We assume that the median is the smaller middle number when there are even elements in the collection. With two heaps, a large top heap contains the smaller (n+1)/2 numbers in the collection, and the other small top heap contains the larger half of the set. When querying the median, look directly at the heap top elements of the big top heap. When inserting an element, it is compared to two heap top elements to determine which heap to insert. If the difference between the two piles of elements is more than 1 after the insertion, the heap top element of the heap is inserted into the other heap. When you delete an element, the number of elements in the two heap is also adjusted after the median is deleted.

Found an implementation of someone else on GitHub: Click to view

2. Find the median of two ordered arrays.

This is another variant that can be extended to find the K-digits of two ordered arrays. The simplest idea, of course, is to merge the arrays, and then directly seek the K-digits of the ordered array, which is an O (n) solution. However, for the "Kth element in 2 sorted array" problem, for example, two median A[M/2] and B[N/2], the array can be divided into four parts. Which part is discarded depends on two conditions: 1, (M/2 + N/2)? K;2,A[M/2]? B[N/2];

if (M/2 + n/2) > K, then it means that the current median is higher, and the correct median is either in section 1 or Section3. If A[M/2] > B[N/2] means that the median is definitely not in section 2, then the new search can discard the interval. In the same way, you can infer the remaining three cases as follows:

if (m/2+n/2+1) > K && am/2 > Bn/2, Drop section 2If (M/2+n/2+1) > K && am/2 < BN/2,  Drop Section 4if (M/2+n/2+1) < K && AM/2 > BN/2,  drop section 3if (M/2+n/2+1) < K && AM/2 < BN/2,  drop section 1

In a nutshell, it is either discarding the right interval of the largest median, or discarding the left interval of the smallest number of median.

The following code is attached:

DoubleFindmediansortedarrays (intA[],intMintB[],intN) {2:if((n+m)%2 ==0)  3:      {  4:return(Getmedian (A,m,b,n, (m+n)/2) + Getmedian (A,m,b,n, (m+n)/2+1)/2.0; 5:      }  6:Else7:returnGetmedian (A,m,b,n, (m+n)/2+1); 8:    }  9:intGetmedian (intA[],intNintB[],intMintk)10:       {  11:assert(A &&b); 12:if(n <= 0)returnB[k-1]; 13:if(M <= 0)returnA[k-1]; 14:if(k <= 1)returnMin (a[0], b[0]); 15:if(B[M/2] >= A[N/2])  16:            {  17:if((N/2 + 1 + m/2) >=k)18:returnGetmedian (A, n, b, M/2, K); 19:Else20:returnGetmedian (A + N/2 + 1, N-(N/2 + 1), B, M, K-(N/2 + 1)); 21st:            }  22:Else23:            {  24:if((M/2 + 1 + n/2) >=k)25:returnGetmedian (A, N/2, B, M, k); 26:Else27:returnGetmedian (A, n, B + M/2 + 1, M-(M/2 + 1), K-(M/2 + 1)); 28:            }  29:}
Kth number of 2 sorted arraysfour or one extensions ("The beauty of programming" 2.5 after-class exercises):

1. What if you need to find the largest number of K different floating-point numbers in n? For example, the largest 3 different floating-point numbers in an array with 10 floating-point numbers (1.5,1.5,2.5,3.5,3.5,5,0,-1.5,3.5) are (5,3.5,2.5).
Solution: The above solution is applicable, it should be noted that the floating-point number when compared with the whole number of different, the other way to find HashKey will be slightly different.


2. What if it is to find a large number from K to M (0<k<=m<=n)?
Answer: If the problem is regarded as a m-k+1, the previous solution is applicable. However, for similar problems such as the former K, it is better to use solution 5 or Solution 7, the overall complexity is low.


3. In search engines, each page on the network has "authoritative" weights, such as page rank. If we need to find the most weight of the K pages, and the weight of the page will be constantly updated, then how to change the algorithm to achieve rapid update (incremental update) and timely return to the most weighted K pages?
Hint: heap sort? Update the heap when each page weight is updated. Is there a better way to do it?
Answer: To achieve a fast update, we can solve 5, using the mapping of the binary heap, you can make the update operation to O (LOGN)

4. In practical applications, there is also an "accuracy" problem. We may not need to return the largest k elements in the strictest sense, and some errors are allowed at the boundary position. When the user enters a query, for each document D, it has a correlation weighting f (query, D) with this query. Search engines need to return to the user is the most relevant weight of the K page. If there are 10 pages per page, users will not care about the "accuracy" of the search results on page 1000th, and a slight error may be acceptable. For example, we can return to the 10th 001 page of relevance, not the NO. 9999 largest. In this case, how can the algorithm be improved to be faster and more efficient? The number of pages can be large enough for a machine to hold, what to do?

Hint: merge sort? If each machine returns the most relevant K documents, then the set of the most relevant K documents on all machines must contain the most relevant K documents in the Complete collection. Since the boundary situation does not need to be very precise, if each machine returns the best K ' documents, then K ' should be able to value to achieve the most relevant 90%*k documents we return are completely accurate, or ultimately return the most relevant K document accuracy more than 90% (most relevant k documents in 90% The above-mentioned correlations do rank in the top k), or the most relevant K-documents that ultimately return the worst relevance sort did not exceed 110%*k.
Answer: As the tip says, you can have each machine return the most relevant K ' documents, and then use the idea of merge sorting to get the most relevant k in all documents. The best case is that the K documents are evenly distributed across all machines, when each machine is just k ' = k/n (n is the total number of machines); Worst case, all the most relevant K documents appear only on one of the machines, and K ' needs to be approximately equal to K. I think it's a good idea to maintain a heap on each machine and then merge the top elements of the heap into a sort order.

5. As mentioned in 4th, for each document D, relative to the different keywords Q1, q2, ..., QM, respectively, have correlation weights f (d, Q1), F (d, Q2), ..., F (d, QM). If the user entered the keyword QI, we have obtained the most relevant k documents, and the known keyword QJ is similar to the keyword QI, the document with the weight of the two keywords closer, then the keyword QI of the most relevant k documents, to find QJ most relevant K documents have no help?

Answer: it must be helpful. When searching for the most relevant k documents in the keyword QJ, you can search for sections in the QJ related documents, and then in all documents in the Global search section.

Search for the eight major elements of the three algorithms, source code and expansion

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.