Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation

Source: Internet
Author: User

Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation

Elasticsearch has many built-in analyzers, but the default word divider does not support Chinese very well. Therefore, you need to install plug-ins separately. The common ones are the smartcn and IKAnanlyzer of ICTCLAS of the Chinese Emy of sciences. However, IKAnanlyzer does not support the latest Elasticsearch2.2.0 version, however, the smartcn Chinese Word divider is officially supported by default. It provides a analyzer for both Chinese and English texts. The latest version 2.2.0 is supported. However, smartcn does not support user-defined dictionary. It can be used as a test. The following section describes how to support the latest version.
 
Smartcn

Installation Word Segmentation: plugin install analysis-smartcn

Uninstall: plugin remove analysis-smartcn

Test:

Request: POST http: // 127.0.0.1: 9200/_ analyze/

{
"Analyzer": "smartcn ",
"Text": "Lenovo is the world's largest notebook manufacturer"
}

Returned results:

{
"Tokens ":[
{
"Token": "Lenovo ",
"Start_offset": 0,
"End_offset": 2,
"Type": "word ",
"Position": 0
},
{
"Token": "Yes ",
"Start_offset": 2,
"End_offset": 3,
"Type": "word ",
"Position": 1
},
{
"Token": "Global ",
"Start_offset": 3,
"End_offset": 5,
"Type": "word ",
"Position": 2
},
{
"Token": "most ",
"Start_offset": 5,
"End_offset": 6,
"Type": "word ",
"Position": 3
},
{
"Token": "big ",
"Start_offset": 6,
"End_offset": 7,
"Type": "word ",
"Position": 4
},
{
"Token": "",
"Start_offset": 7,
"End_offset": 8,
"Type": "word ",
"Position": 5
},
{
"Token": "Notebook ",
"Start_offset": 8,
"End_offset": 11,
"Type": "word ",
"Position": 6
},
{
"Token": "manufacturer ",
"Start_offset": 11,
"End_offset": 13,
"Type": "word ",
"Position": 7
}
]
}
 

For comparison, let's take a look at the result of the standard word segmentation. In the request, replace smartcn with standard

Then read the returned results:

{
"Tokens ":[
{
"Token": "Contact ",
"Start_offset": 0,
"End_offset": 1,
"Type": "<IDEOGRAPHIC> ",
"Position": 0
},
{
"Token": "think ",
"Start_offset": 1,
"End_offset": 2,
"Type": "<IDEOGRAPHIC> ",
"Position": 1
},
{
"Token": "Yes ",
"Start_offset": 2,
"End_offset": 3,
"Type": "<IDEOGRAPHIC> ",
"Position": 2
},
{
"Token": "All ",
"Start_offset": 3,
"End_offset": 4,
"Type": "<IDEOGRAPHIC> ",
"Position": 3
},
{
"Token": "ball ",
"Start_offset": 4,
"End_offset": 5,
"Type": "<IDEOGRAPHIC> ",
"Position": 4
},
{
"Token": "most ",
"Start_offset": 5,
"End_offset": 6,
"Type": "<IDEOGRAPHIC> ",
"Position": 5
},
{
"Token": "big ",
"Start_offset": 6,
"End_offset": 7,
"Type": "<IDEOGRAPHIC> ",
"Position": 6
},
{
"Token": "",
"Start_offset": 7,
"End_offset": 8,
"Type": "<IDEOGRAPHIC> ",
"Position": 7
},
{
"Token": "pen ",
"Start_offset": 8,
"End_offset": 9,
"Type": "<IDEOGRAPHIC> ",
"Position": 8
},
{
"Token": "NOTE ",
"Start_offset": 9,
"End_offset": 10,
"Type": "<IDEOGRAPHIC> ",
"Position": 9
},
{
"Token": "Ben ",
"Start_offset": 10,
"End_offset": 11,
"Type": "<IDEOGRAPHIC> ",
"Position": 10
},
{
"Token": "Factory ",
"Start_offset": 11,
"End_offset": 12,
"Type": "<IDEOGRAPHIC> ",
"Position": 11
},
{
"Token": "supplier ",
"Start_offset": 12,
"End_offset": 13,
"Type": "<IDEOGRAPHIC> ",
"Position": 12
}
]
}
 

It can be seen that a Chinese character is basically a word that cannot be used.

This article is original by secisland. For more information, see the author and source.
 
IKAnanlyzer supports version 2.2.0

Currently, the latest version on github only supports Elasticsearch2.1.1. The path is https://github.com/medcl/elasticsearch-analysis-ik. But now the latest Elasticsearch has reached 2.2.0, so it must be processed before it can be supported.

1. Download the source code, decompress it to any directory, and modify the pom. xml file in the elasticsearch-analysis-ik-master Directory. Find the <elasticsearch. version> line and change the version number to 2.2.0.

2. Compile the code mvn package.

3. After compilation, the elasticsearch-analysis-ik-1.7.0.zip file will be generated in target \ releases.

4. decompress the file to the Elasticsearch/plugins directory.

5. Add a line to the configuration file: index. analysis. analyzer. ik. type: "ik"

6. Restart Elasticsearch.

Test: Just like the preceding request, you can replace the word segmentation with ik.

Returned results:

{
"Tokens ":[
{
"Token": "Lenovo ",
"Start_offset": 0,
"End_offset": 2,
"Type": "CN_WORD ",
"Position": 0
},
{
"Token": "Global ",
"Start_offset": 3,
"End_offset": 5,
"Type": "CN_WORD ",
"Position": 1
},
{
"Token": "maximum ",
"Start_offset": 5,
"End_offset": 7,
"Type": "CN_WORD ",
"Position": 2
},
{
"Token": "Notebook ",
"Start_offset": 8,
"End_offset": 11,
"Type": "CN_WORD ",
"Position": 3
},
{
"Token": "Notes ",
"Start_offset": 8,
"End_offset": 10,
"Type": "CN_WORD ",
"Position": 4
},
{
"Token": "pen ",
"Start_offset": 8,
"End_offset": 9,
"Type": "CN_WORD ",
"Position": 5
},
{
"Token": "NOTE ",
"Start_offset": 9,
"End_offset": 10,
"Type": "CN_CHAR ",
"Position": 6
},
{
"Token": "Factory ",
"Start_offset": 10,
"End_offset": 12,
"Type": "CN_WORD ",
"Position": 7
},
{
"Token": "manufacturer ",
"Start_offset": 11,
"End_offset": 13,
"Type": "CN_WORD ",
"Position": 8
}
]
}
 

We can see that there are still differences between the two word divider.

Expand the dictionary, add the expected phrase in mydict. dic under config \ ik \ custom, and then restart Elasticsearch. Note that the file encoding is UTF-8 without BOM format encoding.

For example, a new word is added. Then query again:

Request: POST http: // 127.0.0.1: 9200/_ analyze/

Parameters:

{
"Analyzer": "ik ",
"Text": "randd is a Data Security Company"
}
 

Returned results:

{
"Tokens ":[
{
"Token": "mongorandd ",
"Start_offset": 0,
"End_offset": 4,
"Type": "CN_WORD ",
"Position": 0
},
{
"Token": "Ke ",
"Start_offset": 1,
"End_offset": 2,
"Type": "CN_WORD ",
"Position": 1
},
{
"Token": "blue ",
"Start_offset": 2,
"End_offset": 3,
"Type": "CN_WORD ",
"Position": 2
},
{
"Token": "de ",
"Start_offset": 3,
"End_offset": 4,
"Type": "CN_CHAR ",
"Position": 3
},
{
"Token": "One family ",
"Start_offset": 5,
"End_offset": 7,
"Type": "CN_WORD ",
"Position": 4
},
{
"Token": "1 ",
"Start_offset": 5,
"End_offset": 6,
"Type": "TYPE_CNUM ",
"Position": 5
},
{
"Token": "jia ",
"Start_offset": 6,
"End_offset": 7,
"Type": "COUNT ",
"Position": 6
},
{
"Token": "data ",
"Start_offset": 7,
"End_offset": 9,
"Type": "CN_WORD ",
"Position": 7
},
{
"Token": "Security ",
"Start_offset": 9,
"End_offset": 11,
"Type": "CN_WORD ",
"Position": 8
},
{
"Token": "company ",
"Start_offset": 11,
"End_offset": 13,
"Type": "CN_WORD ",
"Position": 9
}
]
}

From the above results, we can see that the randd word is supported.

ElasticSearch latest version 2.20 released and downloaded

Full record of installation and deployment of ElasticSearch on Linux

Elasticsearch installation and usage tutorial

ElasticSearch configuration file Translation

ElasticSearch cluster creation instance

Build a standalone and server environment for distributed search ElasticSearch

Working Mechanism of ElasticSearch

Use Elasticsearch + Logstash + Kibana to build a centralized Log Analysis Platform

ElasticSearch details: click here
ElasticSearch: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.