Elasticsearch 2.2.0 participle article: Chinese participle

Source: Internet
Author: User
Tags lenovo

In Elasticsearch, there are many word breakers (analyzers) built in, but the default word breaker support for Chinese is not very good. So need to install a separate plug-in to support, more commonly used is the CAS ictclas SMARTCN and ikananlyzer effect is good, But currently Ikananlyzer does not support the latest Elasticsearch2.2.0 version, but the SMARTCN Chinese word breaker is officially supported by default, which provides a Chinese or mixed Chinese-English text parser. Support for the latest version of version 2.2.0. However, SMARTCN does not support custom thesaurus, which can be used as a test first. The following sections describe how to support the latest version.


Smartcn

Install Word breaker: Plugin install ANALYSIS-SMARTCN

Uninstall: Plugin Remove ANALYSIS-SMARTCN


Test:

Request: POST http://127.0.0.1:9200/_analyze/

{

"Analyzer": "SMARTCN",

"Text": "Lenovo is the world's largest notebook manufacturer"

}

return Result:

{

"Tokens": [

{

"token": "Lenovo",

"Start_offset": 0,

"End_offset": 2,

"Type": "word",

"Position": 0

},

{

"token": "Yes",

"Start_offset": 2,

"End_offset": 3,

"Type": "word",

"Position": 1

},

{

"Token": "Global",

"Start_offset": 3,

"End_offset": 5,

"Type": "word",

"Position": 2

},

{

"token": "The Most",

"Start_offset": 5,

"End_offset": 6,

"Type": "word",

"Position": 3

},

{

"token": "Big",

"Start_offset": 6,

"End_offset": 7,

"Type": "word",

"Position": 4

},

{

"token": "The",

"Start_offset": 7,

"End_offset": 8,

"Type": "word",

"Position": 5

},

{

"token": "Notebook",

"Start_offset": 8,

"End_offset": 11,

"Type": "word",

"Position": 6

},

{

"Token": "Manufacturer",

"Start_offset": 11,

"End_offset": 13,

"Type": "word",

"Position": 7

}

]

}

As a comparison, we look at the results of the standard participle, in the request CMB SMARTCN, change

And then look back at the result:

{

"Tokens": [

{

"token": "Lian",

"Start_offset": 0,

"End_offset": 1,

"Type": "<IDEOGRAPHIC>",

"Position": 0

},

{

"token": "Think",

"Start_offset": 1,

"End_offset": 2,

"Type": "<IDEOGRAPHIC>",

"Position": 1

},

{

"token": "Yes",

"Start_offset": 2,

"End_offset": 3,

"Type": "<IDEOGRAPHIC>",

"Position": 2

},

{

"token": "Full",

"Start_offset": 3,

"End_offset": 4,

"Type": "<IDEOGRAPHIC>",

"Position": 3

},

{

"token": "Ball",

"Start_offset": 4,

"End_offset": 5,

"Type": "<IDEOGRAPHIC>",

"Position": 4

},

{

"token": "The Most",

"Start_offset": 5,

"End_offset": 6,

"Type": "<IDEOGRAPHIC>",

"Position": 5

},

{

"token": "Big",

"Start_offset": 6,

"End_offset": 7,

"Type": "<IDEOGRAPHIC>",

"Position": 6

},

{

"token": "The",

"Start_offset": 7,

"End_offset": 8,

"Type": "<IDEOGRAPHIC>",

"Position": 7

},

{

"token": "Pen",

"Start_offset": 8,

"End_offset": 9,

"Type": "<IDEOGRAPHIC>",

"Position": 8

},

{

"token": "Kee",

"Start_offset": 9,

"End_offset": 10,

"Type": "<IDEOGRAPHIC>",

"Position": 9

},

{

"token": "Ben",

"Start_offset": 10,

"End_offset": 11,

"Type": "<IDEOGRAPHIC>",

"Position": 10

},

{

"token": "Factory",

"Start_offset": 11,

"End_offset": 12,

"Type": "<IDEOGRAPHIC>",

"Position": 11

},

{

"token": "Quotient",

"Start_offset": 12,

"End_offset": 13,

"Type": "<IDEOGRAPHIC>",

"Position": 12

}

]

}

As can be seen, basically can not use, is a Chinese character has become a word.

This article by Saikesaisi Pharmaceutical Lande (Secisland) original, reprint please indicate the author and source.


Ikananlyzer Support for 2.2.0 version

Currently, the latest version on GitHub only supports Elasticsearch2.1.1, and the path is Https://github.com/medcl/elasticsearch-analysis-ik. But now the newest Elasticsearch has reached 2.2.0 so it has to be handled before it can be supported.


1, download the source code, after downloading to any directory, and then modify the Elasticsearch-analysis-ik-master directory of the Pom.xml file. Find the <elasticsearch.version> line and change the version number to 2.2.0.

2, compile Code MVN package.

3. The elasticsearch-analysis-ik-1.7.0.zip file will be generated in target\releases after the compilation is completed.

4. Unzip the file into the Elasticsearch/plugins directory.

5, modify the configuration file add one line: Index.analysis.analyzer.ik.type: "IK"

6, restart the elasticsearch.

Test: Like the above request, just replace the word breaker with IK

Results returned:

{

"Tokens": [

{

"token": "Lenovo",

"Start_offset": 0,

"End_offset": 2,

"Type": "Cn_word",

"Position": 0

},

{

"Token": "Global",

"Start_offset": 3,

"End_offset": 5,

"Type": "Cn_word",

"Position": 1

},

{

"token": "Max",

"Start_offset": 5,

"End_offset": 7,

"Type": "Cn_word",

"Position": 2

},

{

"token": "Notebook",

"Start_offset": 8,

"End_offset": 11,

"Type": "Cn_word",

"Position": 3

},

{

"token": "Notes",

"Start_offset": 8,

"End_offset": 10,

"Type": "Cn_word",

"Position": 4

},

{

"token": "Pen",

"Start_offset": 8,

"End_offset": 9,

"Type": "Cn_word",

"Position": 5

},

{

"token": "Kee",

"Start_offset": 9,

"End_offset": 10,

"Type": "Cn_char",

"Position": 6

},

{

"token": "Our Factory",

"Start_offset": 10,

"End_offset": 12,

"Type": "Cn_word",

"Position": 7

},

{

"Token": "Manufacturer",

"Start_offset": 11,

"End_offset": 13,

"Type": "Cn_word",

"Position": 8

}

]

}

From this, we can see that the result of two participle participle is different.

To expand the thesaurus, add the desired phrase in the mydict.dic under Config\ik\custom, and then restart the Elasticsearch, it is important to note that the file encoding is UTF-8 no BOM format encoding.

For example, added Saikesaisi pharmaceutical Lande words. Then check again:

Request: POST http://127.0.0.1:9200/_analyze/

Parameters:

{

"Analyzer": "IK",

"Text": "Lande is a data security company"

}

return Result:

{

"Tokens": [

{

"token": "Buck Blue",

"Start_offset": 0,

"End_offset": 4,

"Type": "Cn_word",

"Position": 0

},

{

"token": "gram",

"Start_offset": 1,

"End_offset": 2,

"Type": "Cn_word",

"Position": 1

},

{

"token": "Blue",

"Start_offset": 2,

"End_offset": 3,

"Type": "Cn_word",

"Position": 2

},

{

"token": "De",

"Start_offset": 3,

"End_offset": 4,

"Type": "Cn_char",

"Position": 3

},

{

"token": "One",

"Start_offset": 5,

"End_offset": 7,

"Type": "Cn_word",

"Position": 4

},

{

"token": "One",

"Start_offset": 5,

"End_offset": 6,

"Type": "Type_cnum",

"Position": 5

},

{

"token": "Home",

"Start_offset": 6,

"End_offset": 7,

' type ': ' COUNT ',

"Position": 6

},

{

"token": "Data",

"Start_offset": 7,

"End_offset": 9,

"Type": "Cn_word",

"Position": 7

},

{

"Token": "Security",

"Start_offset": 9,

"End_offset": 11,

"Type": "Cn_word",

"Position": 8

},

{

"token": "Company",

"Start_offset": 11,

"End_offset": 13,

"Type": "Cn_word",

"Position": 9

}

]

}

From the above results, we can see that the Saikesaisi pharmaceutical Lande Word has been supported.

Lande (Secisland) will gradually analyze the features of the latest version of Elasticsearch, and look forward to it. Also welcome to join the Secisland public number for attention.


Elasticsearch 2.2.0 participle article: Chinese participle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.