Elasticsearch 2.2.0 participle article: Chinese participle

Last Update:2016-02-20 Source: Internet

Author: User

Tags lenovo

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Elasticsearch, there are many word breakers (analyzers) built in, but the default word breaker support for Chinese is not very good. So need to install a separate plug-in to support, more commonly used is the CAS ictclas SMARTCN and ikananlyzer effect is good, But currently Ikananlyzer does not support the latest Elasticsearch2.2.0 version, but the SMARTCN Chinese word breaker is officially supported by default, which provides a Chinese or mixed Chinese-English text parser. Support for the latest version of version 2.2.0. However, SMARTCN does not support custom thesaurus, which can be used as a test first. The following sections describe how to support the latest version.

Smartcn

Install Word breaker: Plugin install ANALYSIS-SMARTCN

Uninstall: Plugin Remove ANALYSIS-SMARTCN

Test:

Request: POST http://127.0.0.1:9200/_analyze/

{

"Analyzer": "SMARTCN",

"Text": "Lenovo is the world's largest notebook manufacturer"

}

return Result:

{

"Tokens": [

{

"token": "Lenovo",

"Start_offset": 0,

"End_offset": 2,

"Type": "word",

"Position": 0

{

"token": "Yes",

"Start_offset": 2,

"End_offset": 3,

"Type": "word",

"Position": 1

{

"Token": "Global",

"Start_offset": 3,

"End_offset": 5,

"Type": "word",

"Position": 2

{

"token": "The Most",

"Start_offset": 5,

"End_offset": 6,

"Type": "word",

"Position": 3

{

"token": "Big",

"Start_offset": 6,

"End_offset": 7,

"Type": "word",

"Position": 4

{

"token": "The",

"Start_offset": 7,

"End_offset": 8,

"Type": "word",

"Position": 5

{

"token": "Notebook",

"Start_offset": 8,

"End_offset": 11,

"Type": "word",

"Position": 6

{

"Token": "Manufacturer",

"Start_offset": 11,

"End_offset": 13,

"Type": "word",

"Position": 7

}

]

}

As a comparison, we look at the results of the standard participle, in the request CMB SMARTCN, change

And then look back at the result:

{

"Tokens": [

{

"token": "Lian",

"Start_offset": 0,

"End_offset": 1,

"Type": "<IDEOGRAPHIC>",

"Position": 0

{

"token": "Think",

"Start_offset": 1,

"End_offset": 2,

"Type": "<IDEOGRAPHIC>",

"Position": 1

{

"token": "Yes",

"Start_offset": 2,

"End_offset": 3,

"Type": "<IDEOGRAPHIC>",

"Position": 2

{

"token": "Full",

"Start_offset": 3,

"End_offset": 4,

"Type": "<IDEOGRAPHIC>",

"Position": 3

{

"token": "Ball",

"Start_offset": 4,

"End_offset": 5,

"Type": "<IDEOGRAPHIC>",

"Position": 4

{

"token": "The Most",

"Start_offset": 5,

"End_offset": 6,

"Type": "<IDEOGRAPHIC>",

"Position": 5

{

"token": "Big",

"Start_offset": 6,

"End_offset": 7,

"Type": "<IDEOGRAPHIC>",

"Position": 6

{

"token": "The",

"Start_offset": 7,

"End_offset": 8,

"Type": "<IDEOGRAPHIC>",

"Position": 7

{

"token": "Pen",

"Start_offset": 8,

"End_offset": 9,

"Type": "<IDEOGRAPHIC>",

"Position": 8

{

"token": "Kee",

"Start_offset": 9,

"End_offset": 10,

"Type": "<IDEOGRAPHIC>",

"Position": 9

{

"token": "Ben",

"Start_offset": 10,

"End_offset": 11,

"Type": "<IDEOGRAPHIC>",

"Position": 10

{

"token": "Factory",

"Start_offset": 11,

"End_offset": 12,

"Type": "<IDEOGRAPHIC>",

"Position": 11

{

"token": "Quotient",

"Start_offset": 12,

"End_offset": 13,

"Type": "<IDEOGRAPHIC>",

"Position": 12

}

]

}

As can be seen, basically can not use, is a Chinese character has become a word.

This article by Saikesaisi Pharmaceutical Lande (Secisland) original, reprint please indicate the author and source.

Ikananlyzer Support for 2.2.0 version

Currently, the latest version on GitHub only supports Elasticsearch2.1.1, and the path is Https://github.com/medcl/elasticsearch-analysis-ik. But now the newest Elasticsearch has reached 2.2.0 so it has to be handled before it can be supported.

1, download the source code, after downloading to any directory, and then modify the Elasticsearch-analysis-ik-master directory of the Pom.xml file. Find the <elasticsearch.version> line and change the version number to 2.2.0.

2, compile Code MVN package.

3. The elasticsearch-analysis-ik-1.7.0.zip file will be generated in target\releases after the compilation is completed.

4. Unzip the file into the Elasticsearch/plugins directory.

5, modify the configuration file add one line: Index.analysis.analyzer.ik.type: "IK"

6, restart the elasticsearch.

Test: Like the above request, just replace the word breaker with IK

Results returned:

{

"Tokens": [

{

"token": "Lenovo",

"Start_offset": 0,

"End_offset": 2,

"Type": "Cn_word",

"Position": 0

{

"Token": "Global",

"Start_offset": 3,

"End_offset": 5,

"Type": "Cn_word",

"Position": 1

{

"token": "Max",

"Start_offset": 5,

"End_offset": 7,

"Type": "Cn_word",

"Position": 2

{

"token": "Notebook",

"Start_offset": 8,

"End_offset": 11,

"Type": "Cn_word",

"Position": 3

{

"token": "Notes",

"Start_offset": 8,

"End_offset": 10,

"Type": "Cn_word",

"Position": 4

{

"token": "Pen",

"Start_offset": 8,

"End_offset": 9,

"Type": "Cn_word",

"Position": 5

{

"token": "Kee",

"Start_offset": 9,

"End_offset": 10,

"Type": "Cn_char",

"Position": 6

{

"token": "Our Factory",

"Start_offset": 10,

"End_offset": 12,

"Type": "Cn_word",

"Position": 7

{

"Token": "Manufacturer",

"Start_offset": 11,

"End_offset": 13,

"Type": "Cn_word",

"Position": 8

}

]

}

From this, we can see that the result of two participle participle is different.

To expand the thesaurus, add the desired phrase in the mydict.dic under Config\ik\custom, and then restart the Elasticsearch, it is important to note that the file encoding is UTF-8 no BOM format encoding.

For example, added Saikesaisi pharmaceutical Lande words. Then check again:

Request: POST http://127.0.0.1:9200/_analyze/

Parameters:

{

"Analyzer": "IK",

"Text": "Lande is a data security company"

}

return Result:

{

"Tokens": [

{

"token": "Buck Blue",

"Start_offset": 0,

"End_offset": 4,

"Type": "Cn_word",

"Position": 0

{

"token": "gram",

"Start_offset": 1,

"End_offset": 2,

"Type": "Cn_word",

"Position": 1

{

"token": "Blue",

"Start_offset": 2,

"End_offset": 3,

"Type": "Cn_word",

"Position": 2

{

"token": "De",

"Start_offset": 3,

"End_offset": 4,

"Type": "Cn_char",

"Position": 3

{

"token": "One",

"Start_offset": 5,

"End_offset": 7,

"Type": "Cn_word",

"Position": 4

{

"token": "One",

"Start_offset": 5,

"End_offset": 6,

"Type": "Type_cnum",

"Position": 5

{

"token": "Home",

"Start_offset": 6,

"End_offset": 7,

' type ': ' COUNT ',

"Position": 6

{

"token": "Data",

"Start_offset": 7,

"End_offset": 9,

"Type": "Cn_word",

"Position": 7

{

"Token": "Security",

"Start_offset": 9,

"End_offset": 11,

"Type": "Cn_word",

"Position": 8

{

"token": "Company",

"Start_offset": 11,

"End_offset": 13,

"Type": "Cn_word",

"Position": 9

}

]

}

From the above results, we can see that the Saikesaisi pharmaceutical Lande Word has been supported.

Lande (Secisland) will gradually analyze the features of the latest version of Elasticsearch, and look forward to it. Also welcome to join the Secisland public number for attention.

Elasticsearch 2.2.0 participle article: Chinese participle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More