Stem extraction (stemming) and lemmatization)

Source: Internet
Author: User
Tags stem words

Today, I want to complete the stem extraction function and find that the proposed result is not a meaningful word. In the past, there was still the word form restoration technology.

The following content is excerpted from the comparative analysis of the Morphology restoration method and implementation tool.

Lemmatization refers to the reduction of a language word in any form into a general form (which can express complete semantics), and the extraction of stem words

(Stemming) is the stem or root form of the extracted words (it may not necessarily express the complete semantics ). Word shape reduction and stem extraction are two types of word shape normalization.

Important methods can achieve the purpose of effectively merging word shapes. The two methods have both links and differences.

The common points and contacts are summarized as follows:
(1) consistent objectives. The goal of stem extraction and morphology reduction is to simplify or merge the Flexographic form or derivation form of a word into stem)
Or the basic form of the original form is a process of unified merging of different forms of words.
(2) Result section crossover. The result of stem extraction is partially cross-colored because it is not mutually exclusive with the word form restoration. Some words use these two methods to achieve the same word form conversion effect. For example, the word "dogs"
For "dog", its original form is also "dog ".
(3) Mainstream implementation methods are similar. Currently, the mainstream implementation methods for stem extraction and shape reduction are to extract stem or obtain the original form of words by using the rules in the language or dictionary ing.
(4) similar application fields. It is mainly used in information retrieval, text, natural language processing, and other aspects. Both are the basic steps of these applications.

The differences between the two are summarized as follows:
(1) In principle, stem extraction mainly uses the "reduction" method to convert words into stem, for example, processing "Cats" as "cat ", process "effective" as "effect ". Word-like restoration mainly uses "transformation"
To convert words into their original form, such as processing "drove" as "Drive" and "Driving" as "Drive ".
(2) In terms of complexity, the stem extraction method is relatively simple, and the word shape also needs to return the original form of the word. It needs to analyze the word shape, not only to convert the affixes, also perform part-of-speech recognition to distinguish the same word shape
Differences between different original words. The accuracy of part-of-speech tagging directly affects the accuracy of word shape reduction. Therefore, word shape reduction is more complex.
(3) In terms of implementation methods, although the mainstream methods of stem extraction and morphology reduction are similar, the two have their respective focuses on implementation. The implementation method of stem extraction mainly uses rule changes to remove and reduce suffixes, so as to simplify words. The word form is still relatively complex in principle and has complicated morphological changes, which cannot be completed simply by rules. It relies more on dictionaries to map word shape changes and prototype to generate valid words in the dictionary.

(4) In terms of results, there are also some differences between stem extraction and morphology reduction. The result of stem extraction may not be a complete and meaningful word, but only a part of the word. For example, the result of "Revival" stem extraction is "reviv ", the result of "ailiner" stem extraction is "airlin ". The result obtained after the word form restoration process is a meaningful and complete word, which is generally a valid word in the dictionary.

(5) In the application field, there are also different focuses. Both of them are applied to information retrieval and text processing, but they are different. Stem extraction is more widely used in information retrieval fields, such as SOLR and Lucene. It is used for extended search with coarse granularity. Word-form restoration is mainly used in text mining and natural language processing for more fine-grained and accurate text analysis and expression.

Relatively speaking, stem extraction is a simple and lightweight form merge method. The final result obtained is stem, which is not necessarily of practical significance. The process of word shape reduction is relatively complex. The obtained result is the prototype of the word, which can carry a certain significance. Compared with stem extraction, it has more research and application value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.