"Giventwo log files, each with a billion usernames (each username appendedto the log file), find the usernames existing in both documents inthe most efficient manner. use c/c ++ code. if your code callspre-existing library functions, create each library function fromscratch."
Requirement: C/C ++.
Submission: Source code and a document to describe your solution.
The above is a question I encountered when I used to apply for a company's R & D position. I don't know if I am doing the right thing. I think it is quite interesting. I would like to summarize the ideas and code here.
At that time, the first response to this question was to use a hash table to solve the problem,
After deciding to use a hash table, I began to figure out how to create a simple and efficient hash table. The idea is as follows, because each file has about 0.1 billion user names, it is impossible to apply for a space for each user to store the corresponding string of ta. However, we can apply for a continuous space first, and then set each bit of the space to 0. For example, if we apply for a 1000-bit continuous space, each of the above bits is set to 0, then the 1000 bits will be about 125 bytes, and each byte has 8 bits. This is the case, because there are about 0.1 billion users, we can apply for 1 × 10 space, which is about 12.5M space, the source file of the bucket that uses this space as a hash table is hashMap ).
Next, we will clear each bit of the space to the corresponding function in the source file is clr), and then read the first file, map each user name into a specific number as a hash code, and then set the value on the bit corresponding to the number in the hash table to the set function in the source file )), in this way, when the first file is read, the hash space location 1 corresponding to all user names in the file is set to 1 regardless of whether the location is repeated or not ).
Then, we read the second file. Similarly, we first generate the corresponding hash code based on the user name. Then, we query whether the corresponding bit in the hash table is 1 Based on the hash code, if yes, it indicates that the user name has appeared in the first file. Print the user name to the screen and continue to read the user name until the file is read.
Then, there will be no more, haha. This is the overall framework of the entire source file.
After talking about the overall framework just now, let's talk about the specific details. What is actually more difficult is how to generate the corresponding hash code based on the user name.
Here, I assume that the user name in the file is a string containing letters and numbers, and all I have to do is map the string corresponding to the user name to an exact number, so what should we do?
I have been thinking for A long time and there are many solutions. I have considered multiplying each character in the user name with the 'A' offset and then generating the corresponding hash code, however, this may cause different user names to generate the same hash code. Therefore, we discard this idea and consider adding the square of the Offset of each character and 'A, however, the previous errors may still occur. After thinking for a long time, I decided to use this solution:
Put the username in an array of characters, read each character and the subscript of ta. Then, if the character is A letter, calculate the offset between the character and 'A, then multiply it by its subscript + 1. If the character is 0-9, calculate the number and multiply it by its subscript + 1, finally, the username hash code is formed by adding the product of each character. The corresponding function in the source file is hash_code ).
Well, the general idea of the entire solution is like this. The framework and details are all written. The compilation of source files is no problem, but no data is used for testing.
# Include "stdio. h "# include" string. h "# include" stdlib. h "# define BITSPERWORD 32 // number of digits corresponding to the array element # define N 100000000 // number of digits to apply for space int hashMap [1 + N/BITSPERWORD]; // hash code array // set the bit corresponding to I to 1 // clr initializes all bit bits to 0 // test whether the bit corresponding to I is 1 void set (int I) {hashMap [I/32] | = (1 <(I % 32);} void clr () {memset (hashMap, 0, sizeof (hashMap) * (1 + N/BITSPERWORD);} int test (int I) {return hashMap [I/32] & (1 <(I % 32 ));} // ha Hs_code: generate the hash code of the user name // mapFile is used to map the user name in the first file to the hash array // findUser is used to find the user name that exists in both files, and displayed to the screen void hash_code (char * src, int * hash); void mapFile (char * fileName); void findUser (char * fileName); void main () {char filename1 [40]; // enter the first file name char filename2 [40]; // enter the second file name printf ("Enter the first file name :"); gets (filename1); mapFile (filename1); // map the username of the first file to the hashtable printf ("enter the second file name:"); gets (filename2 ); findUser (filename2); // search And print the same username getcgar (); return;} void mapFile (char * fileName) {FILE * fp; char str_user [80]; // used to store the temporary username int hash = 0; // used to store the hash code if (fp = fopen (fileName, "r") = NULL) {// open the specified file. If the file does not exist, exit printf ("the file does not exist !!! \ N "); exit (0);} while (fscanf (fp," % s ", str_user ))! = EOF) {// read the user name into str_user hash_code (str_user, & hash); // generate the corresponding hash code set (hash ); // position 1 memset (str_user, '/0', sizeof (str_user) corresponding to the hash code; // clear str_user and enter the next loop} fclose (fp ); // close the FILE} void findUser (char * fileName) {FILE * fp; char str_user [80]; // used to store the temporary username int hash = 0; // used to store the hash code if (fp = fopen (fileName, "r") = NULL) {// open the file. if the file does not exist, exit printf ("the file does not exist !!! \ N "); exit (0);} while (fscanf (fp," % s ", str_user ))! = EOF) {// read the user name into str_user hash_code (str_user, & hash); // generate the corresponding hash code if (1 = test (hash )) {// if the corresponding bit of the hash code in the hash table is 1, printf is used ("this user name exists in both files: % s \ n", str_user ); // The user name exists in the previous file, print the user name to the screen} memset (str_user, '/0', sizeof (str_user); // clear str_user, go to the next loop} fclose (fp);} void hash_code (char * src, int * hash) {// src is the character array corresponding to the user name, hash is used to return the generated hash code char * p = src; int index1 = 0; // index1 is used to store characters in the array position int index2 = 0; // index2 is used To store the relative distance between characters and 'A', or the relative distance between characters and '0'. For details, see the code or refer to the document int sum = 0; while (! P) {index1 + = 1; if (* p> = 'A' & * p <= 'Z ') | (* p> = 'A' & * p <= 'Z') {index2 = * p-'A'; sum + = index1 * index2 ;} else if (* p> = '0' & * p <= '9') {index2 = * p-'0'; sum + = index1 * index2 ;} * (++ p);} * hash = sum; return ;}
After completing this question, I sent my solution to the other party and soon received an invitation from the other party. When talking about the specific treatment, I felt that the conditions provided by the other party did not meet my expectations, finally, I refused. I am very disappointed. Sometimes I feel that development is really cheap. Is it my expectation too high? Later, I began to doubt myself.
This article from the "Rain lonely forest" blog, please be sure to keep this source http://coderlin.blog.51cto.com/7386328/1302480