Introduction
Many people use search engines for a variety of reasons, misspelled keywords to search for, such as a keyboard problem (a button is broken), unfamiliar with the international name (Freud's full name Sigmund Freud), accidentally write the wrong mother (sinpsons) or write more A letter (Frusciaante). Many users are familiar with the "You are not looking for" feature carried by the Google search engine. This feature provides some alternative advice when it detects that the search keyword may be misspelled.
Text Search is common in all kinds of applications such as E-commerce websites. E-commerce websites usually provide text search capabilities, so users can find their own product catalogs that match the keyword. Once the user misspelled the keyword, it is likely to lead directly to sales losses. For example, if you run an online store that sells DVDs. Arnold Schwarzenegger, a fan of Arnold Schwarzenegger, wants to buy all the DVDs starring Arnold Schwarzenegger at your shop. The first thing he did was to type Schwarzenegger's name in the search field, but if he spelled the name wrong, he spelled "Arnold Swuazeneger", and if your shop didn't return any relevant results, he would go to another shop to buy it.
One of the solutions to this problem is to use the built-in domain knowledge to implement the "You are not looking for" function, to provide users with "you are not looking for Arnold Schwarzenegger" proposal. This article is going to discuss how to use Java to achieve this function.
Edit Distance algorithm
In the field of information theory and computer science, the editing distance between two strings refers to the number of operations required to replace one of the strings with another character. There are several ways to define editing distances, and there are a number of algorithms that use these definitions to calculate edit distance values. The main algorithms are Levenshtein, Jaro-winkler and N-gram. Jaro-winkler is an extension of the Jaro distance metric, which is used primarily for recording connection areas (duplicate detection). In the Levenshtein algorithm, the distance between two strings is defined as the minimum number of edits required to convert a string to another string, allowing editing to be inserted, deleted, and replaced by a single character. The algorithm was presented by Vladimir Levenshtein in 1965 and named after the author. N-gram is a probabilistic model that predicts the next edited item sequentially, which is widely used in various fields of statistical natural language processing and gene sequence analysis.
This article is not about how to implement these algorithms from the ground up, we should focus on how to use the existing--spellchecker project in Apache Lucene to apply these algorithms.
In simple terms, the Lucene spellchecker implementation includes the main class Spellchecker, and the main class spellchecker the directory, Dictionary, and one of the three stringdistance algorithms. The Spellchecker class uses the policy pattern (GoF) to select the Stringdistance algorithm, and the built-in Stringdistance algorithm is implemented with Jarowinklerdistance, Levenshteindistance, Ngramdistance, the default is Levenshteindistance. You can also adjust the precision, the range of precision is between 0 and 1, the default is 0.5. The precision setting has a significant effect on the result, and you may find that the precision should be set higher than the default value, but you may find that the algorithm does not return any results when it is set too high. Take my dictionary, the accuracy of 0.749 is the best results. The dictionary interface has two direct implementations, and you can write your own implementations.
For our "You are not looking for" implementation, we search for a subset of keywords in the dictionary, look for "close" keywords based on the selected string distance algorithm, and match the distance to the preset precision. Figure 1 is an overview of the class diagrams for Lucene spellchecker.