Application level attempt for the spell checker feature in Solr's Getting Started SOLR

Source: Internet
Author: User
Tags solr

Today, we collected some information about the spelling checker and tried to use the spelling checker.--= encountered a lot of problems

Four configurations for spell checking now I'm just a success, half of it.


---------------------------------

The spell checker feature provides a better user experience when searching, so the main search engine has this feature. Before this, I would like to briefly say what is the spelling checker, in fact, very good understanding, is that you enter the search term, it may be that you lost the wrong, it may be in its search library does not exist in the word, but this time it can give you back, similar or similar results to help you correct.

For example, if you enter the online battery in Baidu, it may not be in the index library, but it is possible to return to the online movie, online TV, watch online and so on some words, these, the use of spell check function.

SOLR as an open-source search server, the spell check, also provides good support, then the following author is about the solr4.3 spell check configuration, before this point, as a spelling checker, in order to improve the accuracy of correction, general correction of the word, Do not do word segmentation , so with string, the configuration of the spell check is mainly configured in Solrconfig.xml.



1, spelling component spellcheckcomponent configuration
2, inside the Searchhandler/select configuration
3, inside the Searchhandler/spell configuration

According to the above three, you can quickly configure the spelling checker function, in fact, I wrote above the 4 steps, in fact, only configure 2, 3 steps on it, the other 4th step with the default value on it, here to write it, just let everyone have an understanding


Spelling Components Spellcheckcomponent It is actually the core of things, in his inside can be configured 1 to multiple spell checker, when launched all the inspectors will be loaded, this time the author mainly introduces 2 spelling checker, one is the default only to the main index to do spelling correction, The other one is the custom load spellings.txt spell check library, with the checker of the spelling correction index Library, the other inspectors you want to use, you go to see the wiki.

Https://cwiki.apache.org/confluence/display/solr/Spell+Checking


1. Configure the spelling checker for solrconfig.xml files in several ways

   <!--spell Check settings-<searchcomponent name= "spellcheck" class= "SOLR. Spellcheckcomponent > <!--Query Analyzer, if not specified, uses the Field field type Word breaker by default--<str name= "Queryanalyzerfieldtype" >text_spell</str> <lst name= "Spellchecker" > <str name= "name" >direct</str> &L T;str name= "field" >suggest</str> <str name= "ClassName" &GT;SOLR. directsolrspellchecker</str> <str name= "distancemeasure" >internal</str> <float name= "ac Curacy ">0.5</float> <int name=" maxedits ">2</int> <int name=" Minprefix ">1</int&        Gt <int name= "maxinspections" >5</int> <int name= "minquerylength" >2</int> <float name = "Maxqueryfrequency" >0.01</float> </lst> <!--read the index of the spelling checker library for correction Yes, using the default configuration, uncomment it--and Lt;lst name= "Spellchecker" > <str name= "ClassName" &GT;SOLR. Filebasedspellchecker</str><!--This component is done by loading a configuration file, check that the source is a field where the file can work?--> <str name= "name" >file</str> <str name= "sourcelocation" >spellings.txt</str> <str name= "characterencoding" >utf-8</s tr> <str name= "Spellcheckindexdir" >spellcheckerFile</str> </lst> <!--= = = ===================================================================================--><lst name= "          Spellchecker "> <!--Optional, it's required when more than one spellchecker is configured.          Select Non-default name with spellcheck.dictionary in Request handler.        AME is optional, if only one spellchecker can not write name fruit with multiple spellchecker, you need to specify Spellcheck.dictionary in the request handler <str name= "name" >base</str> <!--the classname is optional, defaults to Indexbasedspellchecker-- > <str name= "ClassName" &GT;SOLR.     Indexbasedspellchecker</str>   <!--Load tokens from the following field to spell checking, analyzer for the field        ' s type as defined in Schema.xml is used the following field name refers to the basis of the spelling checker, that is, which field to check for user input. --<str name= "field" >suggest</str> <!--Optional, by default use In-memory index (R Amdirectory) spellcheck the location of the index file, is optional, if you do not write the default use memory mode Ramdirectory:/spellchecker1 refers to the COREX\DATA\SPELLCHEC Ker1--<str name= "Spellcheckindexdir" >./spellchecker-base</str> <!--Set the AC Curacy (float) to is used for the suggestions. Default is 0.5-<str name= "accuracy" >0.7</str> <!--when to create a spelling index: Buildoncommit/buildonoptimize--&     Gt <str name= "Buildoncommit" >true</str> </lst> <!--another spell checker, using the jarowinklerdistance distance algorithm class that primary index way Should be able to build it too--<lst name= "Spellchecker" > <str name= "name" >jarowinkler</str> <str name= "ClassName" &GT;SOLR. indexbasedspellchecker</str> <str name= "field" >suggest</str> <!--use a different Dis Tance Measure--<str name= "Distancemeasure" >org.apache.lucene.search.spell.jarowinklerdistance</str&         Gt        <str name= "Spellcheckindexdir" >./spellchecker2</str> <str name= "Buildoncommit" >true</str> </lst> <!--======================================================================================--  > </searchComponent>
2. Configure the SOLR Search Components section---Select and spell sections
<requesthandler name= "/select" class= "SOLR. Searchhandler ">    <!--default values for query parameters can specified, these'll be         overridden by param Eters in the request-to     <lst name= "Defaults" >       <str name= "Echoparams" >explicit</str >       <int name= "Rows" >10</int>     </lst>    <!--This line of code is very important, and without this line, the spell checker is not working--          <arr name= "last-components" >          <str>spellcheck</str>        </arr>     </ Requesthandler>
  <!--spell Finder  --<requesthandler name= "/spell" class= "SOLR. Searchhandler "startup=" lazy ">      <lst name=" Defaults ">        <str name=" DF ">suggest</str> <!--default query fields--       <str name= "spellcheck.dictionary" >direct</str>  <!--use that component      -- <str name= "spellcheck" >on</str>        <str name= "Spellcheck.extendedresults" >true</str>                    <str name= "spellcheck.collate" >true</str>        <str name= "Spellcheck.collateextendedresults" >true</str>             </lst>      <arr name= "last-components" >        <str>spellcheck</ str>      </arr>    </requestHandler>


File Mode description

For the above code, although you can load multiple corrector, but when the spell check can only specify a specific one for correction, then why configure multiple adjustment checker? I personally feel this is mainly convenient in the program run can dynamically switch the corrector.

In Spellings.txt, the custom spelling checker, note that the format of the encoding must be UTF-8 no BOM format, the word inside the SOLR service starts, automatically create Spellcheckerfile folder and load the content into

In the data directory of this library

3.SOLR Main interface Query attempt


This is the specific effect estimate that is used to try the direct method and the parameters of the configuration are very much related


About the other ways--I'm mostly trying to load the way the catalog is not done today.

Tomorrow's try:

The following are some of the principles of information;

One, error correction function , English is called spellcheck, do error correction in English directly, is to see the editing distance of the word, the goal is of course, for any one input, in a large number of correct and reliable query words to find out the editing distance to meet the requirements of one or several.

In the face of such spellcheck task, the model is to calculate the user input error word w, the probability of the correct word c, that is, Argmaxc P (c|w). There are generally two kinds of schemes: one is http://norvig.com/spell-correct.html introduction, the other is the method of Spellchecker in Lucene-suggest.


       1.   First, in the method described in Norvig, details argmaxc p (c| W) Conversion and solution method.   This probability is not good direct calculation, but can be based on Bayes theorem equivalent to Argmaxc p (w|c) *p (c)  / p (W), because it is to compare the size of each C, so P (W) can be omitted, and finally become a ARGMAXC  p (W|c) *p (c) on the line. P (c) can be thought of as the probability that C appears in the text set, and P (w|c) means that it is possible to think of the C result as a W. That would be good, p (c) can be counted from the reliable Corpus, p (w|c) can use the editing distance to simulate the relationship, that is, the probability of small editing distance is large. On the implementation, for one input word, produce a string with editing distance 1, including several cases: delete a character, swap adjacent characters, change one character to another, add one character. The resulting candidate assemblies are larger, and nearly 80% of the correction requirements are met. If you create a larger candidate set with an editing distance of 2 on the basis of editing distance 1, almost all typos are covered. The original text is more elaborate, the modeling ideas are very clear, it is recommended to read carefully, this does not elaborate.

2. The second option is Lucene's Spellchecker method, which is to check the editing distance temporarily into the dictionary, which is to pre-index the dictionary, of course, ngram, a word any 2-bit or 3-bit character index, A character that is entered by the user, similarly, produces a character fragment by 2 or 3 bits, which is retrieved using or, and the hit-more word score is most likely spelled incorrectly. Of course, because it is or query the relationship, so there will be a lot of only "close" words are also hit, so in addition to consider the query hit high score, but also to hit and input to do a step edit distance threshold filter. For example "word", we will have n2:wo/n2:or/n2:rd/n3:wor/n3:ord these fragments to index, when the user enters a worg, will produce n2:wo/n2:or/n2:rg/n3:wor/n3:org, these retrieval conditions, Will find a lot of work, worth and so on. There are some enhancements to the details, such as a larger character fragment weight at both ends of the word, and so on.

The two methods of solving Argmax P (c|w), Norvig method than the Lucene-spellcheck method on-line more than the link, the efficiency of the estimated or almost, but provides a very ingenious solution, it is worth careful taste.


second, the relevant search function, academia research more, there are various formulations, query rewrite,query substitution, query extension and so on, the algorithm is also a variety of, mostly for the results of good-looking added complex calculations, and data-based considerations. General engineering needs a common approach, and then add some special considerations to improve the effect. The past has been fortunate to see a seemingly not very formal paper, the method is very simple, clear thinking, very suitable for practical engineering applications. The paper also does not remember the title, but the mind still remember very clearly: is to find the weak link between the words.

In general, there are three main factors that make up the association between query:

1. The literal meaning of the association ; If a query is more or less a word than the other query, then these two words must be related, the short term is the language of the specific, and vice versa is a generalization, such as "Notebook Memory bar 8G" is "Notebook memory Bar" refinement, in turn see " Notebook Memory Bar "Not only includes" 8G "also includes other capacity, is a more generalized query word.

2. Correlation of user input behavior ; multiple words that are entered consecutively within a session can be thought of as related, that is, a person's needs are reflected in the query term. For example, the user queried the "keyboard" he may also need to buy something else, such as "mouse" and so on, if such a situation occurs many times, then the "keyboard" and "mouse" can be seen as a strong link.

3. The user clicks the Behavior the association ; The user may not be able to find a thing when the word is not repeated to replace the query word, or different people with different expressions, if all point to a result, can be regarded as a different way to find the same thing, the same taste. Then the path that falls to the same result, that is, query, can also be seen as strongly correlated.


These three kinds of relationships are more generic and straightforward, and the data can be easily obtained or recorded by a log. In addition to these, additional relevance considerations can be added to specific business situations. Finally, we can adjust the weights of many relationships based on experience or statistical analysis. Implementation, to a query, you need to check those with its associated queries, can be found. So think can use retrieval system, the traditional retrieval system is the content of the document directly word out a token after the index, here is the query, the special "participle" out of those associated tokens to build the index.


Finally, if you want to combine error correction and related searches, there is a lot to consider. In short, the related search is a service that affects the user experience beyond retrieval, and it is worth devoting effort to doing it well.

Reprint: http://blog.csdn.net/lgnlgn/article/details/8760785








Application level attempt for the spell checker feature in Solr's Getting Started SOLR

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.