Recently made an automatic error Correction demo Web page: nfabo.cn when there are some typos in Query, the search engine will try to correct the errors by similar pinyin
The search engine restores these words to pinyin, replacing them with a known Query with the same pinyin.
However, when the wrong Chinese character is Polyphone , especially if there are many such error inputs, all the search engines basically don't care, or only use one of the most commonly used sound to correct. Because you want to consider all possible combinations of pinyin, in extreme cases will cause an exponential explosion !
My algorithm solves this exponential explosion problem.
- This demo page currently contains only 8 million phrases + word frequency, the data is not very clean
- The algorithm is all running in memory, using 360M of memory, this amount of data, if the traditional method of brute force implementation, and to achieve this performance, the number of terabytes of memory
- This server is a rented virtual cloud host, single core, 3 times times slower than my 2009 laptop
Error correction based on editing distanceFind in a known search termEdit DistanceThe word with the smallest user Query, the use of my algorithm can also be efficiently resolved (not yet a demo page)