Ikanalyzer word segmentation process overview

Last Update:2014-11-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I will die without an opening remark ~ Okay, what is Ik and how to use it? I don't need to explain it too much. I can't explain it too well. Start the text below:

Ik official structure:

From top to bottom:

At the top layer, we don't need to worry too much about it. They are some adapters for Lucene to call.
The main class corresponding to ik segmentation should be iksegmenter, which is the core component of IK work. Provides word splitting control.
Word meta processing sub-unit: Multiple word meta-analyzer with different algorithms, realizing recognition of different types of word meta.
Dictionary: it is mainly used to provide cjksegmenter with dictionary encapsulation of Chinese word recognition capabilities. The algorithm for ing between dictionaries and word elements can be considered here.

Iksegmenter has three default implementations in IK:

Cjksegmenter, Japan, and Korea recognition.
Cn_quantifiersegmenter: Chinese quantifiers.
Lettersegmenter: English letter recognition.

Ik Workflow

Ikanalyzer is an external portal. iktokenizer implements the Lucene tokenizer interface, which is used as a combination with Lucene.
Iksegmenter is the main class for word segmentation.
Initialize analyzecontext in segmenter to provide isegmenter with the context required for recognition. segmenter calls the implementation class of isegmenter for word meta recognition. -- This is probably the most fine-grained source.
In [3], each segmenter is independent of each other, that is, they can completely identify the same or cross-word dollar, and even the same segmenter can recognize the cross-word dollar. (For example, [China, People's Republic of China ).
[3] The provided lexeme may have intersection and ambiguity, therefore, iksegmenter will call ikarbitrator to eliminate ambiguity before returning the word element. usesmart, except quantifiers and number words, uses this method to eliminate crosswords.

The most fine-grained Splitting Algorithm for Forward Iteration of ik

Fine Granularity:

It is already described in 3. More specifically, segmenter identifies the word dollar by words and sets the input "People's Republic of China" and "a single word" is also a word in the dictionary. The process is as follows: "in" is the word dollar and also the prefix (because there are various headers), add the word dollar "in"; continue to the next word "Hua", because it is the prefix, then we can identify "China", and "China" is also the prefix so we add "China" to the word dollar and use it as the prefix to continue. Next we will continue to find that "Chinese" is the word dollar, "Chinese" is the prefix, and so on ......

Iteration:

The individual's understanding of iteration should be the above prefix Iteration Algorithms: Chinese, people, People's Republic of China (for example, not all ).

Forward:

The above is positive.

Ikanalyzer word segmentation process overview

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ikanalyzer word segmentation process overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ikanalyzer word segmentation process overview

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support