Ikanalyzer word segmentation process overview

Source: Internet
Author: User

I will die without an opening remark ~ Okay, what is Ik and how to use it? I don't need to explain it too much. I can't explain it too well. Start the text below:

Ik official structure:

From top to bottom:


  • At the top layer, we don't need to worry too much about it. They are some adapters for Lucene to call.
  • The main class corresponding to ik segmentation should be iksegmenter, which is the core component of IK work. Provides word splitting control.
  • Word meta processing sub-unit: Multiple word meta-analyzer with different algorithms, realizing recognition of different types of word meta.
  • Dictionary: it is mainly used to provide cjksegmenter with dictionary encapsulation of Chinese word recognition capabilities. The algorithm for ing between dictionaries and word elements can be considered here.

Iksegmenter has three default implementations in IK:


  1. Cjksegmenter, Japan, and Korea recognition.
  2. Cn_quantifiersegmenter: Chinese quantifiers.
  3. Lettersegmenter: English letter recognition.
Ik Workflow


  1. Ikanalyzer is an external portal. iktokenizer implements the Lucene tokenizer interface, which is used as a combination with Lucene.
  2. Iksegmenter is the main class for word segmentation.
  3. Initialize analyzecontext in segmenter to provide isegmenter with the context required for recognition. segmenter calls the implementation class of isegmenter for word meta recognition. -- This is probably the most fine-grained source.
  4. In [3], each segmenter is independent of each other, that is, they can completely identify the same or cross-word dollar, and even the same segmenter can recognize the cross-word dollar. (For example, [China, People's Republic of China ).
  5. [3] The provided lexeme may have intersection and ambiguity, therefore, iksegmenter will call ikarbitrator to eliminate ambiguity before returning the word element. usesmart, except quantifiers and number words, uses this method to eliminate crosswords.


The most fine-grained Splitting Algorithm for Forward Iteration of ik


Fine Granularity:

It is already described in 3. More specifically, segmenter identifies the word dollar by words and sets the input "People's Republic of China" and "a single word" is also a word in the dictionary. The process is as follows: "in" is the word dollar and also the prefix (because there are various headers), add the word dollar "in"; continue to the next word "Hua", because it is the prefix, then we can identify "China", and "China" is also the prefix so we add "China" to the word dollar and use it as the prefix to continue. Next we will continue to find that "Chinese" is the word dollar, "Chinese" is the prefix, and so on ......

Iteration:

The individual's understanding of iteration should be the above prefix Iteration Algorithms: Chinese, people, People's Republic of China (for example, not all ).

Forward:

The above is positive.


Ikanalyzer word segmentation process overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.