In Elasticsearch, there are many word breakers (analyzers) built in, but the default word breaker support for Chinese is not very good. So need to install a separate plug-in to support, more commonly used is the CAS ictclas SMARTCN and ikananlyzer effect is good, But currently Ikananlyzer does not support the latest Elasticsearch2.2.0 version, but the SMARTCN Chinese word breaker is officially supported by default, which provides a Chinese or mixed Chinese-English text parser. Support for the latest version of version 2.2.0. However, SMARTCN does not support custom thesaurus, which can be used as a test first. The following sections describe how to support the latest version.
Smartcn
Install Word breaker: Plugin install ANALYSIS-SMARTCN
Uninstall: Plugin Remove ANALYSIS-SMARTCN
Test:
Request: POST http://127.0.0.1:9200/_analyze/
{
"Analyzer": "SMARTCN",
"Text": "Lenovo is the world's largest notebook manufacturer"
}
return Result:
{
"Tokens": [
{
"token": "Lenovo",
"Start_offset": 0,
"End_offset": 2,
"Type": "word",
"Position": 0
},
{
"token": "Yes",
"Start_offset": 2,
"End_offset": 3,
"Type": "word",
"Position": 1
},
{
"Token": "Global",
"Start_offset": 3,
"End_offset": 5,
"Type": "word",
"Position": 2
},
{
"token": "The Most",
"Start_offset": 5,
"End_offset": 6,
"Type": "word",
"Position": 3
},
{
"token": "Big",
"Start_offset": 6,
"End_offset": 7,
"Type": "word",
"Position": 4
},
{
"token": "The",
"Start_offset": 7,
"End_offset": 8,
"Type": "word",
"Position": 5
},
{
"token": "Notebook",
"Start_offset": 8,
"End_offset": 11,
"Type": "word",
"Position": 6
},
{
"Token": "Manufacturer",
"Start_offset": 11,
"End_offset": 13,
"Type": "word",
"Position": 7
}
]
}
As a comparison, we look at the results of the standard participle, in the request CMB SMARTCN, change
And then look back at the result:
{
"Tokens": [
{
"token": "Lian",
"Start_offset": 0,
"End_offset": 1,
"Type": "<IDEOGRAPHIC>",
"Position": 0
},
{
"token": "Think",
"Start_offset": 1,
"End_offset": 2,
"Type": "<IDEOGRAPHIC>",
"Position": 1
},
{
"token": "Yes",
"Start_offset": 2,
"End_offset": 3,
"Type": "<IDEOGRAPHIC>",
"Position": 2
},
{
"token": "Full",
"Start_offset": 3,
"End_offset": 4,
"Type": "<IDEOGRAPHIC>",
"Position": 3
},
{
"token": "Ball",
"Start_offset": 4,
"End_offset": 5,
"Type": "<IDEOGRAPHIC>",
"Position": 4
},
{
"token": "The Most",
"Start_offset": 5,
"End_offset": 6,
"Type": "<IDEOGRAPHIC>",
"Position": 5
},
{
"token": "Big",
"Start_offset": 6,
"End_offset": 7,
"Type": "<IDEOGRAPHIC>",
"Position": 6
},
{
"token": "The",
"Start_offset": 7,
"End_offset": 8,
"Type": "<IDEOGRAPHIC>",
"Position": 7
},
{
"token": "Pen",
"Start_offset": 8,
"End_offset": 9,
"Type": "<IDEOGRAPHIC>",
"Position": 8
},
{
"token": "Kee",
"Start_offset": 9,
"End_offset": 10,
"Type": "<IDEOGRAPHIC>",
"Position": 9
},
{
"token": "Ben",
"Start_offset": 10,
"End_offset": 11,
"Type": "<IDEOGRAPHIC>",
"Position": 10
},
{
"token": "Factory",
"Start_offset": 11,
"End_offset": 12,
"Type": "<IDEOGRAPHIC>",
"Position": 11
},
{
"token": "Quotient",
"Start_offset": 12,
"End_offset": 13,
"Type": "<IDEOGRAPHIC>",
"Position": 12
}
]
}
As can be seen, basically can not use, is a Chinese character has become a word.
This article by Saikesaisi Pharmaceutical Lande (Secisland) original, reprint please indicate the author and source.
Ikananlyzer Support for 2.2.0 version
Currently, the latest version on GitHub only supports Elasticsearch2.1.1, and the path is Https://github.com/medcl/elasticsearch-analysis-ik. But now the newest Elasticsearch has reached 2.2.0 so it has to be handled before it can be supported.
1, download the source code, after downloading to any directory, and then modify the Elasticsearch-analysis-ik-master directory of the Pom.xml file. Find the <elasticsearch.version> line and change the version number to 2.2.0.
2, compile Code MVN package.
3. The elasticsearch-analysis-ik-1.7.0.zip file will be generated in target\releases after the compilation is completed.
4. Unzip the file into the Elasticsearch/plugins directory.
5, modify the configuration file add one line: Index.analysis.analyzer.ik.type: "IK"
6, restart the elasticsearch.
Test: Like the above request, just replace the word breaker with IK
Results returned:
{
"Tokens": [
{
"token": "Lenovo",
"Start_offset": 0,
"End_offset": 2,
"Type": "Cn_word",
"Position": 0
},
{
"Token": "Global",
"Start_offset": 3,
"End_offset": 5,
"Type": "Cn_word",
"Position": 1
},
{
"token": "Max",
"Start_offset": 5,
"End_offset": 7,
"Type": "Cn_word",
"Position": 2
},
{
"token": "Notebook",
"Start_offset": 8,
"End_offset": 11,
"Type": "Cn_word",
"Position": 3
},
{
"token": "Notes",
"Start_offset": 8,
"End_offset": 10,
"Type": "Cn_word",
"Position": 4
},
{
"token": "Pen",
"Start_offset": 8,
"End_offset": 9,
"Type": "Cn_word",
"Position": 5
},
{
"token": "Kee",
"Start_offset": 9,
"End_offset": 10,
"Type": "Cn_char",
"Position": 6
},
{
"token": "Our Factory",
"Start_offset": 10,
"End_offset": 12,
"Type": "Cn_word",
"Position": 7
},
{
"Token": "Manufacturer",
"Start_offset": 11,
"End_offset": 13,
"Type": "Cn_word",
"Position": 8
}
]
}
From this, we can see that the result of two participle participle is different.
To expand the thesaurus, add the desired phrase in the mydict.dic under Config\ik\custom, and then restart the Elasticsearch, it is important to note that the file encoding is UTF-8 no BOM format encoding.
For example, added Saikesaisi pharmaceutical Lande words. Then check again:
Request: POST http://127.0.0.1:9200/_analyze/
Parameters:
{
"Analyzer": "IK",
"Text": "Lande is a data security company"
}
return Result:
{
"Tokens": [
{
"token": "Buck Blue",
"Start_offset": 0,
"End_offset": 4,
"Type": "Cn_word",
"Position": 0
},
{
"token": "gram",
"Start_offset": 1,
"End_offset": 2,
"Type": "Cn_word",
"Position": 1
},
{
"token": "Blue",
"Start_offset": 2,
"End_offset": 3,
"Type": "Cn_word",
"Position": 2
},
{
"token": "De",
"Start_offset": 3,
"End_offset": 4,
"Type": "Cn_char",
"Position": 3
},
{
"token": "One",
"Start_offset": 5,
"End_offset": 7,
"Type": "Cn_word",
"Position": 4
},
{
"token": "One",
"Start_offset": 5,
"End_offset": 6,
"Type": "Type_cnum",
"Position": 5
},
{
"token": "Home",
"Start_offset": 6,
"End_offset": 7,
' type ': ' COUNT ',
"Position": 6
},
{
"token": "Data",
"Start_offset": 7,
"End_offset": 9,
"Type": "Cn_word",
"Position": 7
},
{
"Token": "Security",
"Start_offset": 9,
"End_offset": 11,
"Type": "Cn_word",
"Position": 8
},
{
"token": "Company",
"Start_offset": 11,
"End_offset": 13,
"Type": "Cn_word",
"Position": 9
}
]
}
From the above results, we can see that the Saikesaisi pharmaceutical Lande Word has been supported.
Lande (Secisland) will gradually analyze the features of the latest version of Elasticsearch, and look forward to it. Also welcome to join the Secisland public number for attention.
Elasticsearch 2.2.0 participle article: Chinese participle