Find the maximum number of K and the C ++ Implementation of topk Problems

Source: Internet
Author: User

Calculate the sum of the 0.2 billion largest integers.

Question: A file contains 0.2 billion integers separated. Calculate the sum of the 1 million integers.

Algorithm:
1. First, create an int array with a capacity of 1 million (ntop) and read the integer filling from the file.
2. Use the heap to maintain the 1 million records (ensure that the top element of the heap is the minimum value)
3. Read an integer from the file and compare it with the heap top element. If it is greater than the heap top element, replace the element and adjust the heap structure.
4. Repeat Step 3 until the data is read.
5. Add all elements in the array. The result is displayed.

Reference source code:

# Include <iostream> using namespace STD; Template <class T> class ctopk {public: ctopk ();~ Ctopk (); T * m_data; int gettopk (const char * sfile, Int & ntop); Private: void clear (); void heapadjust (INT nstart, int nlen ); void buildheap (INT nlen) ;}; template <class T> ctopk <t >:: ctopk () {m_data = NULL;} template <class T> ctopk <t> :: ~ Ctopk () {clear () ;}template <class T> void ctopk <t >:: clear () {If (null! = M_data) {Delete [] m_data; m_data = NULL ;}// obtain the number of top K. template <class T> int ctopk <t> :: gettopk (const char * sfile, Int & ntop) {file * fp = NULL; t fdata; int I = 0; // judge the input parameter if (null = sfile) | (ntop <= 0) {cout <"error parameter" <Endl; Return-1 ;}// clear (); // open the file fp = fopen (sfile, "R"); If (null = FP) {cout <"Open File failed! "<Endl; Return-1 ;}// allocate space m_data = new T [ntop]; If (null = m_data) {cout <" new operator failed! "<Endl; Return-1 ;}cout <" Please wait... "<Endl; // read the first ntop data. Note the data type T for (I = 0; I <ntop; I ++) {If (EOF! = Fscanf (FP, "% d", & fdata) {m_data [I] = fdata;} else {break ;}/// the maximum number is less than ntop, find the first I data if (I <ntop) {ntop = I;} else {buildheap (ntop); // create a small top heap while (EOF! = Fscanf (FP, "% d", & fdata) {If (fdata> m_data [0]) {// swap and adjust the heap m_data [0] = fdata; heapadjust (0, ntop) ;}} return 0 ;}// adjust the small root heap. The top K minimum template <class T> void ctopk <t> :: heapadjust (INT nstart, int nlen) {int nminchild = 0; t ftemp; while (2 * nstart + 1) <nlen) {nminchild = 2 * nstart + 1; if (2 * nstart + 2) <nlen) {// compare the left and right subtree, index if (m_data [2 * nstart + 2] <m_data [2 * nstart + 1]) {nminchild = 2 * nstart + 2 ;}} // change data if (m_data [nstart]> m_data [nminchild]) {// exchange nstart and nmaxchild data ftemp = m_data [nstart]; m_data [nstart] = m_data [nminchild]; m_data [nminchild] = ftemp; // The heap is damaged. You need to re-adjust nstart = nminchild ;} else {// if the number of children is large, the heap is not damaged. You no longer need to adjust the break; }}// create a heap template <class T> void ctopk <t> :: buildheap (INT nlen) {int I = 0; t ntemp; // build m_data [0, Len-1] into a small root heap where only one small root heap is maintained, unordered for (I = nlen/2-1; I> = 0; I --) {heapadjust (I, nlen) ;}} int main (INT argc, char * argv []) {char szfile [100] = {0}; int nnum = 0; ctopk <int> objtopsum; cout <"Please input count File Name: "<Endl; CIN> szfile; cout <" Please input top num: "<Endl; CIN> nnum; objtopsum. gettopk (szfile, nnum); int fsum = 0; For (INT I = 0; I <nnum; I ++) {cout <objtopsum. m_data [I] <"; fsum + = objtopsum. m_data [I];} cout <"\ ntop" <nnum <"value =" <fsum <Endl; return 0 ;}

 

Search engine popularity query statistics

Description:
The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes.
Suppose there are currently 10 million records, and these query strings have a high degree of repetition. Although the total number is 10 million, the number of query strings should not exceed 3 million after repetition. The higher the repetition of a query string, the more users query it, that is, the more popular it is. Please count the top 10 query strings. The memory required cannot exceed 1 GB.

Solution: hash table + heap

Step 1: pre-process the massive data and use the hash table to complete statistics in O (n) time;
Step 2: Use the heap data structure to find the top K. The time complexity is nlogk. That is, with the help of the heap structure, we can find and adjust/move in the time of log magnitude. Therefore, maintain a small root heap (kmin is set to the top element of the heap) K (10 in this question) and traverse the 3 million query, compare with kmin (if x> kmin, update and adjust the heap; otherwise, do not update). The final time complexity is: O (n) + n '* O (logk), (N is 10 million, n is 3 million ).

To reduce the implementation difficulty, assume that all these records are English words, that is, the user inputs an English word in the search box, then queries the search results, and finally, count the first K words with the highest frequency in the input words. After the complex problem is simplified, it is easier to write code, as shown below:

// Copyright @ yansha & July // July, updated, 2011.05.08 // question description: // The Search Engine records all the search strings used for each search using log files. The length of each query string is 1-bytes. Suppose there are currently 10 million records (these query strings have a high degree of repetition, although the total number is 10 million, but if the repetition is removed, there are no more than 3 million records. The higher the repetition of a query string, the more users query it, that is, the more popular). // Please count the top 10 query strings, the required memory cannot exceed 1 GB. # Include <iostream> # include <string> # include <assert. h> using namespace STD; # define hashlen 2807303 # define wordlen 30 // node pointer typedef struct node_no_space * ptr_no_space; typedef struct node_has_space * ptr_has_space; ptr_no_space head [hashlen]; struct node_no_space {char * word; int count; ptr_no_space next;}; struct node_has_space {char word [wordlen]; int count; ptr_has_space next;}; // The simplest Hash function int hash_function (const char * P) {int value = 0; while (* P! = '\ 0') {value = value * 31 + * P ++; If (value> hashlen) value = Value % hashlen;} return value ;} // Add words to the hash table void append_word (const char * Str) {int Index = hash_function (STR); ptr_no_space P = head [Index]; while (P! = NULL) {If (strcmp (STR, p-> word) = 0) {(p-> count) ++; return;} p = p-> next ;} // create a new node ptr_no_space q = new node_no_space; q-> COUNT = 1; q-> word = new char [strlen (STR) + 1]; strcpy (Q-> word, STR); q-> next = head [Index]; head [Index] = Q;} // write the word processing result to the file void write_to_file () {file * fp = fopen ("result.txt", "W"); Assert (FP); int I = 0; while (I 

 

Refer:

Http://blog.csdn.net/andylin02/archive/2008/11/28/3401123.aspx

Http://blog.csdn.net/v_JULY_v/archive/2011/05/08/6403777.aspx

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.