Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation

Last Update:2016-03-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Elasticsearch has many built-in analyzers, but the default word divider does not support Chinese very well. Therefore, you need to install plug-ins separately. The common ones are the smartcn and IKAnanlyzer of ICTCLAS of the Chinese Emy of sciences. However, IKAnanlyzer does not support the latest Elasticsearch2.2.0 version, however, the smartcn Chinese Word divider is officially supported by default. It provides a analyzer for both Chinese and English texts. The latest version 2.2.0 is supported. However, smartcn does not support user-defined dictionary. It can be used as a test. The following section describes how to support the latest version.

Smartcn

Installation Word Segmentation: plugin install analysis-smartcn

Uninstall: plugin remove analysis-smartcn

Test:

Request: POST http: // 127.0.0.1: 9200/_ analyze/

{
"Analyzer": "smartcn ",
"Text": "Lenovo is the world's largest notebook manufacturer"
}

Returned results:

{
"Tokens ":[
{
"Token": "Lenovo ",
"Start_offset": 0,
"End_offset": 2,
"Type": "word ",
"Position": 0
},
{
"Token": "Yes ",
"Start_offset": 2,
"End_offset": 3,
"Type": "word ",
"Position": 1
},
{
"Token": "Global ",
"Start_offset": 3,
"End_offset": 5,
"Type": "word ",
"Position": 2
},
{
"Token": "most ",
"Start_offset": 5,
"End_offset": 6,
"Type": "word ",
"Position": 3
},
{
"Token": "big ",
"Start_offset": 6,
"End_offset": 7,
"Type": "word ",
"Position": 4
},
{
"Token": "",
"Start_offset": 7,
"End_offset": 8,
"Type": "word ",
"Position": 5
},
{
"Token": "Notebook ",
"Start_offset": 8,
"End_offset": 11,
"Type": "word ",
"Position": 6
},
{
"Token": "manufacturer ",
"Start_offset": 11,
"End_offset": 13,
"Type": "word ",
"Position": 7
}
]
}

For comparison, let's take a look at the result of the standard word segmentation. In the request, replace smartcn with standard

Then read the returned results:

{
"Tokens ":[
{
"Token": "Contact ",
"Start_offset": 0,
"End_offset": 1,
"Type": "<IDEOGRAPHIC> ",
"Position": 0
},
{
"Token": "think ",
"Start_offset": 1,
"End_offset": 2,
"Type": "<IDEOGRAPHIC> ",
"Position": 1
},
{
"Token": "Yes ",
"Start_offset": 2,
"End_offset": 3,
"Type": "<IDEOGRAPHIC> ",
"Position": 2
},
{
"Token": "All ",
"Start_offset": 3,
"End_offset": 4,
"Type": "<IDEOGRAPHIC> ",
"Position": 3
},
{
"Token": "ball ",
"Start_offset": 4,
"End_offset": 5,
"Type": "<IDEOGRAPHIC> ",
"Position": 4
},
{
"Token": "most ",
"Start_offset": 5,
"End_offset": 6,
"Type": "<IDEOGRAPHIC> ",
"Position": 5
},
{
"Token": "big ",
"Start_offset": 6,
"End_offset": 7,
"Type": "<IDEOGRAPHIC> ",
"Position": 6
},
{
"Token": "",
"Start_offset": 7,
"End_offset": 8,
"Type": "<IDEOGRAPHIC> ",
"Position": 7
},
{
"Token": "pen ",
"Start_offset": 8,
"End_offset": 9,
"Type": "<IDEOGRAPHIC> ",
"Position": 8
},
{
"Token": "NOTE ",
"Start_offset": 9,
"End_offset": 10,
"Type": "<IDEOGRAPHIC> ",
"Position": 9
},
{
"Token": "Ben ",
"Start_offset": 10,
"End_offset": 11,
"Type": "<IDEOGRAPHIC> ",
"Position": 10
},
{
"Token": "Factory ",
"Start_offset": 11,
"End_offset": 12,
"Type": "<IDEOGRAPHIC> ",
"Position": 11
},
{
"Token": "supplier ",
"Start_offset": 12,
"End_offset": 13,
"Type": "<IDEOGRAPHIC> ",
"Position": 12
}
]
}

It can be seen that a Chinese character is basically a word that cannot be used.

This article is original by secisland. For more information, see the author and source.

IKAnanlyzer supports version 2.2.0

Currently, the latest version on github only supports Elasticsearch2.1.1. The path is https://github.com/medcl/elasticsearch-analysis-ik. But now the latest Elasticsearch has reached 2.2.0, so it must be processed before it can be supported.

1. Download the source code, decompress it to any directory, and modify the pom. xml file in the elasticsearch-analysis-ik-master Directory. Find the <elasticsearch. version> line and change the version number to 2.2.0.

2. Compile the code mvn package.

3. After compilation, the elasticsearch-analysis-ik-1.7.0.zip file will be generated in target \ releases.

4. decompress the file to the Elasticsearch/plugins directory.

5. Add a line to the configuration file: index. analysis. analyzer. ik. type: "ik"

6. Restart Elasticsearch.

Test: Just like the preceding request, you can replace the word segmentation with ik.

Returned results:

{
"Tokens ":[
{
"Token": "Lenovo ",
"Start_offset": 0,
"End_offset": 2,
"Type": "CN_WORD ",
"Position": 0
},
{
"Token": "Global ",
"Start_offset": 3,
"End_offset": 5,
"Type": "CN_WORD ",
"Position": 1
},
{
"Token": "maximum ",
"Start_offset": 5,
"End_offset": 7,
"Type": "CN_WORD ",
"Position": 2
},
{
"Token": "Notebook ",
"Start_offset": 8,
"End_offset": 11,
"Type": "CN_WORD ",
"Position": 3
},
{
"Token": "Notes ",
"Start_offset": 8,
"End_offset": 10,
"Type": "CN_WORD ",
"Position": 4
},
{
"Token": "pen ",
"Start_offset": 8,
"End_offset": 9,
"Type": "CN_WORD ",
"Position": 5
},
{
"Token": "NOTE ",
"Start_offset": 9,
"End_offset": 10,
"Type": "CN_CHAR ",
"Position": 6
},
{
"Token": "Factory ",
"Start_offset": 10,
"End_offset": 12,
"Type": "CN_WORD ",
"Position": 7
},
{
"Token": "manufacturer ",
"Start_offset": 11,
"End_offset": 13,
"Type": "CN_WORD ",
"Position": 8
}
]
}

We can see that there are still differences between the two word divider.

Expand the dictionary, add the expected phrase in mydict. dic under config \ ik \ custom, and then restart Elasticsearch. Note that the file encoding is UTF-8 without BOM format encoding.

For example, a new word is added. Then query again:

Request: POST http: // 127.0.0.1: 9200/_ analyze/

Parameters:

{
"Analyzer": "ik ",
"Text": "randd is a Data Security Company"
}

Returned results:

{
"Tokens ":[
{
"Token": "mongorandd ",
"Start_offset": 0,
"End_offset": 4,
"Type": "CN_WORD ",
"Position": 0
},
{
"Token": "Ke ",
"Start_offset": 1,
"End_offset": 2,
"Type": "CN_WORD ",
"Position": 1
},
{
"Token": "blue ",
"Start_offset": 2,
"End_offset": 3,
"Type": "CN_WORD ",
"Position": 2
},
{
"Token": "de ",
"Start_offset": 3,
"End_offset": 4,
"Type": "CN_CHAR ",
"Position": 3
},
{
"Token": "One family ",
"Start_offset": 5,
"End_offset": 7,
"Type": "CN_WORD ",
"Position": 4
},
{
"Token": "1 ",
"Start_offset": 5,
"End_offset": 6,
"Type": "TYPE_CNUM ",
"Position": 5
},
{
"Token": "jia ",
"Start_offset": 6,
"End_offset": 7,
"Type": "COUNT ",
"Position": 6
},
{
"Token": "data ",
"Start_offset": 7,
"End_offset": 9,
"Type": "CN_WORD ",
"Position": 7
},
{
"Token": "Security ",
"Start_offset": 9,
"End_offset": 11,
"Type": "CN_WORD ",
"Position": 8
},
{
"Token": "company ",
"Start_offset": 11,
"End_offset": 13,
"Type": "CN_WORD ",
"Position": 9
}
]
}

From the above results, we can see that the randd word is supported.

ElasticSearch latest version 2.20 released and downloaded

Full record of installation and deployment of ElasticSearch on Linux

Elasticsearch installation and usage tutorial

ElasticSearch configuration file Translation

ElasticSearch cluster creation instance

Build a standalone and server environment for distributed search ElasticSearch

Working Mechanism of ElasticSearch

Use Elasticsearch + Logstash + Kibana to build a centralized Log Analysis Platform

ElasticSearch details: click here
ElasticSearch: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support