10 million text messages to find the top 10 duplicates
Analysis: For the subject, some interviewers want to use the database method to achieve: first the text into the database, and then use the SELECT statement some methods to obtain the first 10 text messages. But actually using the database is not enough to solve this condition for 5 minutes. This is because 10 million text messages even 1 seconds into 10,000 (this is already very fast data entry) 5 minutes only 3 million. Even if you can actually record 10 million in 5 minutes, you must first build the index, otherwise the SQL statement will not be able to produce results in 5 minutes. But indexing 10 million records is impossible even within 5 minutes. So the method of using the database is not possible.
This type of problem arises because the Internet company needs to deal with the huge amount of data/logs generated by the users all the time, so the problem of massive data is now very hot, basically the internet company will be in the examination. Focus on your data structure design and the basic skills of the algorithm. A similar topic is how to search for the top 10 sites that are accessed by keyword.
Answer:
Method 1: You can use a hash table method to divide 10 million pieces into groups for edge-scan edge-building hash list. First scan, take the first byte, the tail byte, the middle random two bytes as a hash Code, inserted into the hash table. and record its address and information length and number of repetitions, 10 million information, record the information can be put down. The same hash code and the same length is suspected of the same, compared. The same record is added 1 times into the hash table, but the number of repetitions is added to 1. After a single scan, the number of repetitions has been recorded and the second hash table is processed. Use linear time selection to complete the first 10 searches at the O (n) level. After grouping each part of the TOP10 must be guaranteed to be different, can be hashed to ensure, or directly according to the size of the hash value to classify.
Method 2: Can be used from small to large sorting method, according to experience, unless it is a mass of the holiday message, otherwise the less the number of text messages appear the higher the chance of repetition. It is recommended to start from a short text message, such as a first search for a word of text messages, to find the repeated occurrence of TOP10 and record the number of occurrences, and then search two words, and so on. For the same word number of the more frequent SMS search, in addition to algorithms such as hash, you can choose to extract only the head, and the tail and other positions of the characters to be rough, because this kind of judgment is to speed up the search speed but not the real expectations of the TOP10, so need to make a mark; after searching again, Alternative top10 can be found from each of the TOP10 results, and if the TOP10 has just been tagged, then all text messages with their corresponding number of words are searched accurately to find the true Top10 and compare again.
Method 3: Can adopt the method of memory mapping, first 10 million text messages according to the current text message length will not exceed 1G space, using memory mapping file is more appropriate. You can map one at a time (of course, if you have a larger amount of data, you can take a segmented map), which greatly increases the speed of data loading because you do not need to use file I/O frequently and allocate small memory frequently. Second, for each text message, I (i from 0 to 70) letters by the ASCII group, in fact, is to create a tree. I is the depth of the tree, and also the letter I letters.
The main problem is to solve two aspects of content, one is content loading, and the other is the comparison of SMS content. Using the file memory mapping technology can solve the performance problem of content loading (not only need to call file I/O functions, but also do not need to allocate a small chunk of memory every time a text message is read), and the use of tree technology can effectively reduce the number of comparisons.