About 10% of the queries processed by the search engine contain errors (the conclusion of a msra paper). For the Chinese search engine, pinyin correction can handle many such queries. For example:
Zhou huajian |
Zhou huajian |
|
|
Zhou xingchi Zhou Jielun Zhang Shaohan
|
Zhou xingchi Jay Chou Zhang Shaohan
|
|
|
And so on. How can we achieve this? Generally, there are two steps: 1. Voice annotation of query 2. There is a pinyin dictionary. If the voice annotation of query is found in the pinyin dictionary. The difficulty is generally several: the construction of the pinyin dictionary. Currently, it is more realistic to use logs. 2. The voice annotation must be able to recognize the pronunciation disturbance of the region, such as l and N. 3 is the processing of new words and long words. The word cannot be corrected because the pinyin dictionary cannot keep up with the update of the new word. Long-term correction also does not include this long string in the log, but long-string substrings can actually be corrected, therefore, word segmentation with incorrect words is involved. Now let's take a simple look at how Google is doing: Test 1: Nokia norm Nokia Test 2: Nokia norm 6310 Nokia 6310 Test 3: Nokia norm 6310 Nokia Experiment 4: nov pressure 6310 Liu Yifei Nov pressure Nokia 6310 Liu Yifei Nokia Test 5: Nov pressure 6310 Liu Yifei QQ Nov pressure Nokia 6310 Liu Yifei QQ Nokia Nov pressure 6310 Liu Yifei QQ Nokia Test 6: nov pressure 6310 Liu Yifei QQ Nov pressure Nokia 6310 Liu Yifei QQ Nokia Nov pressure 6310 Liu Yifei QQ Nokia
We can see that Google is very good at using user information, such as space separation and number separation. Second, the length of the query is also limited. exceeding a certain value may not cause error correction.