Programmer intelligence platform: high-frequency vocabulary Extraction

Source: Internet
Author: User

"The competition of technology and the collision of thinking-the intelligence platform, to show programmers the stage of wisdom ." Frequently-frequency vocabulary extraction-optimizing the program of frequently-frequency words in statistical plain text/Wang Yao's frequently-frequency vocabulary extraction is a very interesting question and a typical computer algorithm question. It mainly involves two classic topics: "sorting" and "Search. The speed depends on the design of the corresponding data structure. It is helpful for the basic quality training of programmers. This algorithm can be divided into two parts: "count" and "sort. Let's take a look at the "count" section first: Let's simplify the problem, assuming that each word is composed of five letters, a total of 26 ^ 5 = 11881376 possible permutation and combination. You can define an int-type array with a size of 11881376 and a memory size of about 47 M. It is used as a counter. We traverse the file once and use the five letters at the current position as the index of the int array (for example, if the five letters are ABCDE, we can use the ASC ⅱ code of each letter to subtract 97, obtain the sequences 0, 1, 2, 3, and 4, and then calculate [4 + 3*26 + 2*(26 ^ 2) + 1*(26 ^ 3) + 0*(26 ^ 4)], get 19010, that is, the index value.) based on this index, add 1 to the values of related items in the array. After the traversal, the counting part is complete. The problem is that the number of words we want to count is not limited to five letters. Each letter is added, the number of possible permutation and combination increases exponentially. Take 8 letters as an example, 26 ^ 8 is about 208 GB, 4 bytes of data for each int type, and the memory required for the counter is 832 GB, which is not what we can afford. Moreover, a word is composed of N words. We do not know how many letters each word contains. So we go back to the starting point and rethink the counting algorithm. First, we certainly need a counter. Each item in the counter corresponds to a word, and the counter size must be within the memory. However, this can be done. A 10 m text, including the actual types of phrases, will never exceed 10 m. The memory of modern computers can satisfy most of the texts, the problem only lies in the efficiency of counter space usage (of course, for a large text, not only does its own file size exceed the memory capacity, in addition, the number of entries in the file itself exceeds the memory capacity. You can read files in segments and count them in segments. When the counter reaches a certain size, it is written to the disk and then a new counter is used, finally, combine these counters. This is equivalent to using the cache technology. The operating system itself also has a cache mechanism, which can only use the memory and hand over the cache to the operating system. However, for our applications, the cache mechanism of the operating system may not be efficient. This article does not discuss this aspect. It only assumes that the actual number of entries in the file can be accommodated in the memory ). The counter must be easy to index. That is to say, the corresponding position can be found efficiently in the counter for entries consisting of n words at each current position in the file, to facilitate counting. The fastest search algorithm is to use a hash table. Its search time is fixed, but it has nothing to do with the total number of entries. The counter we mentioned earlier is actually a hash table, the index calculation formula is a hash function, and each array element is a "Bucket" in the hash table, and each bucket has only one entry at most, which is very fast and ideal. However, since the memory is limited, we must consider limiting the size of the hash table to a tolerable range. For example, if the size is fixed to 11881376 as mentioned above, when the number of "possible" entries is much larger than 11 Mb, each "Bucket" may contain multiple projects. However, a 11881376-byte text file, the actual number of entries does not exceed 11881376. If the hash function is well designed, each "actual" entry can be evenly divided into different buckets, and each bucket still has only one project, the search speed is still very fast, so the hash function is a key. We naturally put forward two requirements for the design of hash functions: 1) different actual entries should be divided into different buckets as much as possible; so that many different actual entries should be concentrated in some buckets, in addition, empty buckets reduce the search efficiency. 2) The calculation is as simple as possible, because each "insert" or "Search" hash table must execute a hash function. The complexity of the hash function directly affects the speed of the program. This solution can be used: each entry uses 5 bytes evenly (for example, if the entry contains 10 bytes, 1st, 3, 5, 7, and 9 bytes are extracted ), index according to the previous method. The entry with less than five bytes is supplemented by Z (considering that the entry contains spaces, each byte actually has 27 possible values, so the hash table size is changed to 27 ^ 5 = 14348907, and the Space value is set to 26, ~ The value of Z is 0 ~ 25, Case Insensitive ). Tip: When the memory is sufficient, the hash table can be larger to reduce the number of possible entries in each bucket and improve the time efficiency. Therefore, you can set it to dozens of MB and modify the hash function accordingly. However, in any case, we can expect only one entry in each bucket "generally. As long as the hash table is smaller than the number of "possible" entries (which has been analyzed previously, it is almost certain), we still need to face several entries in a "Bucket, in addition, due to the differences in the test files, we cannot predict in advance that the "maximum" number of entries in a bucket will reach several. Therefore, we need to use a linked list to place these uncertain numbers of entries. Note that the linked list should not be implemented by temporarily applying for memory each time the entry is added, because the system's memory management mechanism involves methods such as "first suitable method" and "optimal suitable method", or a mixture of these methods, the implementation is complicated, therefore, you should not regard applying for memory as a fast job as adding. Frequent applying for or releasing memory is definitely inefficient. Therefore, we need to open a second piece of memory to store these uncertain numbers of entries (also called Memory B, and call the hash table memory ). In this way, the hash table neither the entry content nor the entry count, but the link to these entries. Each time a new entry is encountered, it is placed into the memory sequentially and its position (pointer) in the memory is included in the corresponding bucket of the hash table, if a pointer already exists in the corresponding bucket, find the corresponding item along the pointer and point the subsequent pointer to the new entry. In this way, each non-empty bucket in the hash table has a pointer pointing to one item in Memory B, which also has a pointer pointing to the second item in the Same bucket, and so on, if there is no subsequent item, it is null, which is the chained storage of hash tables that we are familiar. In order to avoid frequent memory application, Memory B can apply for a larger value as much as possible, but the number of projects that can be accommodated does not need to be larger than the number of bytes of files, because, 5 m files cannot contain 6 m entries. In fact, We conservatively estimate that our program may have to process files up to 20 m, with each word containing about 4 letters, when the space character and entry repetition are counted, it is enough that memory B can put 4 m projects. We can also use the dynamic memory technology. When Memory B is not enough, apply for a larger memory size (for example, twice the size), copy the data, and release the old memory. If it is not enough, you can manage the cache by yourself (this article will not discuss it ). Now we need to consider how to search for items in each bucket in a hash table. We need to traverse this chain using a chained linear data structure, the time for each search and insertion is related to the number of items in the bucket. This is an economic practice. The improved method is to use the binary search tree. The specific method is as follows: each item in Memory B contains the entry string, the counting field, and the llink and rlink pointers, point to an item smaller than it and an item larger than it (or empty) in the same bucket respectively, apply a hash function to the current entry, and find the corresponding bucket in the hash table, if a pointer exists in the bucket, it is compared with the corresponding item (item a) of the pointer. If it is equal, 1 is added to the counting field of this item, compare the items pointed to by the rlink of this item (item B )......, If item B is smaller than item B, the current entry is placed at the end of all items existing in item B, and the llink of item B points to it. In computer programming, the binary search tree has been analyzed in mathematics. readers who are familiar with data structures and search algorithms must know that: the insertion and search time of this algorithm are related to lgn (n is the number of projects on a single tree, and LG uses 2 as the base). It is a recommended algorithm. In some extreme cases, the binary search tree will become a hidden tree (all nodes have only one or more subnodes, so it becomes a disguised Linear Linked List), but the probability of occurrence is very small, there is an algorithm called "AVL Tree", which can prevent the emergence of a balanced tree, but also reduces the average speed (because the algorithm becomes more complex ), only when the number of nodes in the tree is more than 256 can the advantage be reflected. Because hash tables have been used before the binary tree search, the "Balance Tree" is unnecessary. In addition, we also need a memory block (memory C) to store strings. When we add a project to Memory B, we need to store the entry string to memory C (in sequence, put "/0" at the end of the string), and record the string pointer in the corresponding item of Memory B. The reason why the string is not placed in memory B is that the length of the entry is variable, and we will sort Memory B later. Each project has the same data volume, easy to traverse and move. Memory C should also be larger at the beginning (for example, 10 m). dynamic memory methods can be applied like memory B. Their principles are the same. Now let's discuss sorting. The "sorting" in this question is different from the sorting in the traditional sense. It only requires to find the most M phrases, and there is no need to sort phrases other than M, phrases within M do not need to be sorted. In this way, there are many redundant steps to use the classic sorting algorithm. We can start with the basic idea of sorting algorithms. First of all, we can easily think of sorting by comparison: We select all the items that appear on Memory B more than 1 and arrange them to the new memory (Memory D ), note the maximum I in this process. If the number of projects on Memory D is greater than m, the memory D is larger than I = (I/2) project B. If there are less than m projects on D, supplement B. In this way, the memory B and D are used in turn, and a part of data is picked out each time, the rest is excluded, and the scope is constantly reduced, and m largest projects are finally picked out. If the maximum number is 1024, the number of "moving" operations of this algorithm will not exceed lg1024 = 10, and the number of projects to be migrated will also be greatly reduced. A better improvement is to sort by distribution: divide all projects into the 10 regions by the number of times, starting from the area with a large number of times, the number of projects in each region is accumulated one by one until the value is greater than or equal to M. If the value is equal to m, the project ends successfully. If the value is greater than m, in this case, the projects in this region that will be larger than m will be divided out ....... In this way, logstores can be divided into up to 24 times (logs are based on 10 times ). With this algorithm, the general file may be divided into three to four times. Digression: in fact, the topic of searching for frequently, M is 5, and the top five are "Search", "sogao", "High Frequency", "Frequency Word", and "word ", the second and fourth are obviously unqualified, and we need a large dictionary. Even so, linguistic knowledge such as syntax analysis is also necessary, such: the frequently-used word "Korean Ginseng" appears in this article. We obviously should not extract "beauty" from it ". However, as a topic of "smart middleware", we will not discuss it more professionally.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.