SOLR itself on the Chinese word processing is not too good, so the Chinese application often need to add a Chinese word breaker to Chinese word processing, Ik-analyzer is one of the good Chinese word breaker.
First, version information
SOLR version: 4.7.0
Requires Ik-analyzer version: IK Analyzer 2012ff_hf1
Ik-analyzer:http://code.google.com/p/ik-analyzer/downloads/list
Second, the configuration steps
Download the compressed Extract folder after extracting the following directory structure:
We copy the Ikanalyzer2012ff_u1.jar to the Solr\web-inf\lib under the SOLR service.
We copy the IKAnalyzer.cfg.xml and Stopword.dic to the Conf of the core that needs to use the word breaker, and a directory for the core Schema.xml file.
To modify the schema.xml of the core, add the following configuration between the <types></types> configuration items:
[HTML]View Plaincopy
- <FieldType name="Text_ik" class= "SOLR. TextField ">
- <analyzer class="Org.wltea.analyzer.lucene.IKAnalyzer"/>
- </FieldType>
We've got a Text_ik field type, and the word breaker used by this type is Ik-analyzer.
We can use Text_ik when we configure the field type in the schema.xml of this core.
[HTML]View Plaincopy
- <field name="name" type="Text_ik" indexed="true" stored="true" mult ivalued="false" />
Third, Chinese word segmentation test
[HTML]View Plaincopy
- IKT
- Text
- Raw_bytes
- Start
- End
- Type
- Position
- People's Republic of China
- [E4 b8 ad e5 8d 8e e4 ba ba e6 b0 e5 b1 e5 8c e5 9b BD]
- 0
- 7
- Cn_word
- 1
- Chinese people
- [E4 b8 ad e5 8d 8e e4 ba ba e6 b0 91]
- 0
- 4
- Cn_word
- 2
- Chinese
- [E4 b8 ad e5 8d 8e]
- 0
- 2
- Cn_word
- 3
- Chinese
- [E5 8d 8e e4 ba Ba]
- 1
- 3
- Cn_word
- 4
- People's Republic
- [E4 ba ba e6 b0 e5-B1 e5 8c e5 9b BD]
- 2
- 7
- Cn_word
- 5
- People
- [E4 ba ba e6 b0 91]
- 2
- 4
- Cn_word
- 6
- Republic
- [E5 b1 e5 8c e5 9b BD]
- 4
- 7
- Cn_word
- 7
- Republican
- [E5 b1 e5 8c]
- 4
- 6
- Cn_word
- 8
- Country
- [E5 9b BD]
- 6
- 7
- Cn_char
- 9
solr4.7 Chinese word breaker (ik-analyzer) configuration