lucene-A word breaker introduction very good understanding of the article

Source: Internet
Author: User

This article is from here
In the previous concept we have learned that the function of the parser is to divide the sentence into words in terms of semantics. The English segmentation already has the very mature Analyzer: StandardAnalyzer, in many cases standardanalyzer is a good choice. Even you will find that StandardAnalyzer can also be a word for Chinese.
But our focus is Chinese participle, standardanalyzer can support Chinese word segmentation? The practice proved to be possible, but the effect is not good, search "if" will "milk is not as good as juice" also search out, and index file is very large. So what else do we have on hand that can be used? There is no core inside, we can find two in the sandbox: Chineseanalyzer and Cjkanalyzer. But they also have the problem of not allowed to participle. In contrast with StandardAnalyzer and Chineseanalyzer indexing time is similar, index file size is similar, cjkanalyzer performance will be worse, index file large and time-consuming longer.
To solve the problem, first analyze the three parser of the word breaker. StandardAnalyzer and Chineseanalyzer are the sentences in a single word, that is, "milk is not as good as juice," they will be cut into "milk is not as good as juice", and Cjkanalyzer will be cut into "cow grandma, if the juice is good to drink." 。 This also explains why the search for "juice" can match this sentence.
There are at least two drawbacks to the above participle: mismatched matching and large index file. Our goal is to break down the above sentences into "milk is not as good as juice." The key here is semantic recognition, how do we recognize that "milk" is a word and "milk not" is not a word? We will naturally think of the word-base segmentation method, that is, we get a thesaurus, which lists most of the words, we divide the sentence in a certain way, when the resulting words and words in the Thesaurus match, we think this segmentation is correct. The process of cutting words into a matching process, and the most simple way to match the most positive maximum matching and inverse maximum match two, White is a sentence from the beginning of the match, a forward from the end of the sentence to match. Thesaurus-based word base is very important, the capacity of the thesaurus directly affects the search results, under the premise of the same thesaurus, it is said that the inverse maximum match is better than the forward maximum match.
Of course, there are other methods of word segmentation, which is a discipline in itself, and I do not have in-depth study here. Back to specific applications, our goal is to find mature, ready-made word breaker tools to avoid reinventing the wheel. After online search, with more is the CAS Ictclas and a non-open source but free je-analysis. Ictclas There is a problem is that it is a dynamic link library, Java calls require local method calls, inconvenient and security risks, and the word of mouth is really not good. Je-analysis effect is also good, of course, there will be no place, compared to more convenient and assured.

lucene-A word breaker introduction very good understanding of the article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.