Elasticsearch 2.2.0 Word Segmentation: Chinese Word Segmentation
Elasticsearch has many built-in analyzers, but the default word divider does not support Chinese very well. Therefore, you need to install plug-ins separately. The common ones are the smartcn and IKAnanlyzer of ICTCLAS of the Chinese Emy of sciences. However, IKAnanlyzer does not support the latest Elasticsearch2.2.0 version, however, the smartcn Chinese Word divider is officially supported by default. It provides a analyzer for both Chinese and English texts. The latest version 2.2.0 is supported. However, smartcn does not support user-defined dictionary. It can be used as a test. The following section describes how to support the latest version.
Smartcn
Installation Word Segmentation: plugin install analysis-smartcn
Uninstall: plugin remove analysis-smartcn
Test:
Request: POST http: // 127.0.0.1: 9200/_ analyze/
{
"Analyzer": "smartcn ",
"Text": "Lenovo is the world's largest notebook manufacturer"
}
Returned results:
{
"Tokens ":[
{
"Token": "Lenovo ",
"Start_offset": 0,
"End_offset": 2,
"Type": "word ",
"Position": 0
},
{
"Token": "Yes ",
"Start_offset": 2,
"End_offset": 3,
"Type": "word ",
"Position": 1
},
{
"Token": "Global ",
"Start_offset": 3,
"End_offset": 5,
"Type": "word ",
"Position": 2
},
{
"Token": "most ",
"Start_offset": 5,
"End_offset": 6,
"Type": "word ",
"Position": 3
},
{
"Token": "big ",
"Start_offset": 6,
"End_offset": 7,
"Type": "word ",
"Position": 4
},
{
"Token": "",
"Start_offset": 7,
"End_offset": 8,
"Type": "word ",
"Position": 5
},
{
"Token": "Notebook ",
"Start_offset": 8,
"End_offset": 11,
"Type": "word ",
"Position": 6
},
{
"Token": "manufacturer ",
"Start_offset": 11,
"End_offset": 13,
"Type": "word ",
"Position": 7
}
]
}
For comparison, let's take a look at the result of the standard word segmentation. In the request, replace smartcn with standard
Then read the returned results:
{
"Tokens ":[
{
"Token": "Contact ",
"Start_offset": 0,
"End_offset": 1,
"Type": "<IDEOGRAPHIC> ",
"Position": 0
},
{
"Token": "think ",
"Start_offset": 1,
"End_offset": 2,
"Type": "<IDEOGRAPHIC> ",
"Position": 1
},
{
"Token": "Yes ",
"Start_offset": 2,
"End_offset": 3,
"Type": "<IDEOGRAPHIC> ",
"Position": 2
},
{
"Token": "All ",
"Start_offset": 3,
"End_offset": 4,
"Type": "<IDEOGRAPHIC> ",
"Position": 3
},
{
"Token": "ball ",
"Start_offset": 4,
"End_offset": 5,
"Type": "<IDEOGRAPHIC> ",
"Position": 4
},
{
"Token": "most ",
"Start_offset": 5,
"End_offset": 6,
"Type": "<IDEOGRAPHIC> ",
"Position": 5
},
{
"Token": "big ",
"Start_offset": 6,
"End_offset": 7,
"Type": "<IDEOGRAPHIC> ",
"Position": 6
},
{
"Token": "",
"Start_offset": 7,
"End_offset": 8,
"Type": "<IDEOGRAPHIC> ",
"Position": 7
},
{
"Token": "pen ",
"Start_offset": 8,
"End_offset": 9,
"Type": "<IDEOGRAPHIC> ",
"Position": 8
},
{
"Token": "NOTE ",
"Start_offset": 9,
"End_offset": 10,
"Type": "<IDEOGRAPHIC> ",
"Position": 9
},
{
"Token": "Ben ",
"Start_offset": 10,
"End_offset": 11,
"Type": "<IDEOGRAPHIC> ",
"Position": 10
},
{
"Token": "Factory ",
"Start_offset": 11,
"End_offset": 12,
"Type": "<IDEOGRAPHIC> ",
"Position": 11
},
{
"Token": "supplier ",
"Start_offset": 12,
"End_offset": 13,
"Type": "<IDEOGRAPHIC> ",
"Position": 12
}
]
}
It can be seen that a Chinese character is basically a word that cannot be used.
This article is original by secisland. For more information, see the author and source.
IKAnanlyzer supports version 2.2.0
Currently, the latest version on github only supports Elasticsearch2.1.1. The path is https://github.com/medcl/elasticsearch-analysis-ik. But now the latest Elasticsearch has reached 2.2.0, so it must be processed before it can be supported.
1. Download the source code, decompress it to any directory, and modify the pom. xml file in the elasticsearch-analysis-ik-master Directory. Find the <elasticsearch. version> line and change the version number to 2.2.0.
2. Compile the code mvn package.
3. After compilation, the elasticsearch-analysis-ik-1.7.0.zip file will be generated in target \ releases.
4. decompress the file to the Elasticsearch/plugins directory.
5. Add a line to the configuration file: index. analysis. analyzer. ik. type: "ik"
6. Restart Elasticsearch.
Test: Just like the preceding request, you can replace the word segmentation with ik.
Returned results:
{
"Tokens ":[
{
"Token": "Lenovo ",
"Start_offset": 0,
"End_offset": 2,
"Type": "CN_WORD ",
"Position": 0
},
{
"Token": "Global ",
"Start_offset": 3,
"End_offset": 5,
"Type": "CN_WORD ",
"Position": 1
},
{
"Token": "maximum ",
"Start_offset": 5,
"End_offset": 7,
"Type": "CN_WORD ",
"Position": 2
},
{
"Token": "Notebook ",
"Start_offset": 8,
"End_offset": 11,
"Type": "CN_WORD ",
"Position": 3
},
{
"Token": "Notes ",
"Start_offset": 8,
"End_offset": 10,
"Type": "CN_WORD ",
"Position": 4
},
{
"Token": "pen ",
"Start_offset": 8,
"End_offset": 9,
"Type": "CN_WORD ",
"Position": 5
},
{
"Token": "NOTE ",
"Start_offset": 9,
"End_offset": 10,
"Type": "CN_CHAR ",
"Position": 6
},
{
"Token": "Factory ",
"Start_offset": 10,
"End_offset": 12,
"Type": "CN_WORD ",
"Position": 7
},
{
"Token": "manufacturer ",
"Start_offset": 11,
"End_offset": 13,
"Type": "CN_WORD ",
"Position": 8
}
]
}
We can see that there are still differences between the two word divider.
Expand the dictionary, add the expected phrase in mydict. dic under config \ ik \ custom, and then restart Elasticsearch. Note that the file encoding is UTF-8 without BOM format encoding.
For example, a new word is added. Then query again:
Request: POST http: // 127.0.0.1: 9200/_ analyze/
Parameters:
{
"Analyzer": "ik ",
"Text": "randd is a Data Security Company"
}
Returned results:
{
"Tokens ":[
{
"Token": "mongorandd ",
"Start_offset": 0,
"End_offset": 4,
"Type": "CN_WORD ",
"Position": 0
},
{
"Token": "Ke ",
"Start_offset": 1,
"End_offset": 2,
"Type": "CN_WORD ",
"Position": 1
},
{
"Token": "blue ",
"Start_offset": 2,
"End_offset": 3,
"Type": "CN_WORD ",
"Position": 2
},
{
"Token": "de ",
"Start_offset": 3,
"End_offset": 4,
"Type": "CN_CHAR ",
"Position": 3
},
{
"Token": "One family ",
"Start_offset": 5,
"End_offset": 7,
"Type": "CN_WORD ",
"Position": 4
},
{
"Token": "1 ",
"Start_offset": 5,
"End_offset": 6,
"Type": "TYPE_CNUM ",
"Position": 5
},
{
"Token": "jia ",
"Start_offset": 6,
"End_offset": 7,
"Type": "COUNT ",
"Position": 6
},
{
"Token": "data ",
"Start_offset": 7,
"End_offset": 9,
"Type": "CN_WORD ",
"Position": 7
},
{
"Token": "Security ",
"Start_offset": 9,
"End_offset": 11,
"Type": "CN_WORD ",
"Position": 8
},
{
"Token": "company ",
"Start_offset": 11,
"End_offset": 13,
"Type": "CN_WORD ",
"Position": 9
}
]
}
From the above results, we can see that the randd word is supported.
ElasticSearch latest version 2.20 released and downloaded
Full record of installation and deployment of ElasticSearch on Linux
Elasticsearch installation and usage tutorial
ElasticSearch configuration file Translation
ElasticSearch cluster creation instance
Build a standalone and server environment for distributed search ElasticSearch
Working Mechanism of ElasticSearch
Use Elasticsearch + Logstash + Kibana to build a centralized Log Analysis Platform
ElasticSearch details: click here
ElasticSearch: click here
This article permanently updates the link address: