SOLR itself on the Chinese word processing is not too good, so the Chinese application often need to add a Chinese word breaker to Chinese word processing, Ik-analyzer is one of the good Chinese word breaker.
First, version information
SOLR version: 4.7.0
Requires Ik-analyzer version: IK Analyzer 2012ff_hf1
Ik-analyzer Download Address: http://code.google.com/p/ik-analyzer/downloads/list
Second, the configuration steps
Download the compressed Extract folder after extracting the following directory structure:
We copy the Ikanalyzer2012ff_u1.jar to the Solr\web-inf\lib under the SOLR service.
We copy the IKAnalyzer.cfg.xml and Stopword.dic to the Conf of the core that needs to use the word breaker, and a directory for the core Schema.xml file.
To modify the schema.xml of the core, add the following configuration between the <types></types> configuration items:
<fieldtype name= "Text_ik" class= "SOLR. TextField ">
<analyzer class=" Org.wltea.analyzer.lucene.IKAnalyzer "/>
</fieldType>
We've got a Text_ik field type, and the word breaker used by this type is Ik-analyzer.
We can use Text_ik when we configure the field type in the schema.xml of this core.
<field name= "name" type= "Text_ik" indexed= "true" stored= "true"
Third, Chinese word segmentation test
IKT
text
raw_bytes
start
end
type
position
People's Republic of China
[E4 b8 ad e5 8d 8e e4 ba ba e6 b0 e5 B1 e5 8c e5 9b BD]
0
7
cn_word
1
Chinese people
[e4 b8 ad e5 8d 8e e4 ba ba e6 b0]
0
4
Cn_word
2
Chinese
[e4 b8 ad e5 8d 8e]
0
2
cn_word
3
Chinese
[e5 8d 8e e4 ba BA]
1
3
cn_word
4
People's Republic
[E4 ba ba e6 b0 e5 7 b1 e5 8c e5 9b BD]
2
CN _word
5
people
[e4 ba ba e6 b0]
2
4
cn_word
6
Republic
[E5 b1 e5 8c e5 9b BD]
4
7
Cn_word
7
Republic
[E5 b1 e5 8c]
4
6
Cn_word
8
countries
[ E5 9b BD]
6
7
Cn_char
9