Background:
In the use of search engines and the power of the electronic business, we must have encountered such a situation: I want to search blog park, can not be careful to lose into the blog, do not worry about the search results you want, because the search engine based on large data will help you automatically correct, For this example, Google and Baidu return to me are:
Displays the results of the following query words: Blog Park and you are looking for is not: Blog Park, they have done automatic error correction, about automatic error correction I also wrote a text before, was the implementation of their own N-gram model, but the effect is not too good, mainly for different corpus algorithm accuracy is not the same, I want to try a different algorithm, the current mainstream calculation Ging distance (on the contrary, you can also be understood as similarity) is Levenshtein, when to achieve, discover Lucene has done this thing, then we stand on the shoulders of giants grow it.
Reference Package:
Lucene-core-3.1.0.jar + Lucene-spellchecker-3.1.0.jar, you can get here
Use examples:
Add the following code to the main method of class Spellcorrector
Create directory
File Dict = new file ("");
Directory directory = Fsdirectory.open (dict);
Instantiate the spelling checker
spellchecker sp = new spellchecker (directory);
Create dictionary
File dictionary = new file (SpellCorrecter.class.getResource ("Dictionary.txt"). GetFile ());
Index the dictionary
sp.indexdictionary (New Plaintextdictionary (dictionary));
Search String with typos
= "very undisturbed";
The number of suggestions, here I just want the closest one, you can set to other numbers, such as 3
//View this column more highlights: http://www.bianceng.cnhttp://www.bianceng.cn/Programming /java/
int suggestionnumber = 1;
Get the recommended keyword
string[] suggestions = sp.suggestsimilar (search, suggestionnumber);
Display results
System.out.println ("Search:" + searches);
for (String word:suggestions) {
System.out.println ("What you are looking for is not:" + word);
}
Note: Before you need to have a corpus, I am here a file with the correct video name, the format is as follows:
Beauty
of Blood and tears ice fire general passion in the hands of the
enemy
Wind competition boat King second
fishing beetle
Xiaoxiang Road A
play outside the second season
Prairie Wolf Jazz
Save Private Ryan
OK, let's run it directly, see the following figure:
complete code and a dictionary here (limited to work reasons, the dictionary only retains part of the movie name, you can use your own corpus)