Example of solr Chinese Word Segmentation mmseg4j
Copyright information: this information can be reproduced at will. During reprinting, you must mark the article as a hyperlink.Source, That is, the following statement.
Source: http://blog.chenlb.com/2009/04/solr-chinese-segment-mmseg4j-use-demo.html
The first version of mmseg4j can be easily integrated with solr. There is a simple description on google code, and the first version of the blog also has a simple usage Description: Chinese Word Segmentation mmseg4j. To better illustrate how to use mmseg4j Chinese Word Segmentation in solr, write a blog.
At present, there are two versions of mmseg4j and version 1.7 are relatively memory-consuming (a dictionary directory is about 50 MB), so the default jvm memory size will throw OutOfMemoryErroy. I will use two dictionary directories here, so the latest version 1.7.2 is not needed. Version 1.6.2 is used. Download: mmseg4j-1.6.2 and word library, or download a source package (including the word library, from the source code build please see: Chinese Word Segmentation mmseg4j 1.7.2 release), put the mmseg4j-all-1.6.2.jar to solr. home/lib.
Mmseg4j mainly supports two parameters in solr: mode and dicPath. Mode indicates the mode word segmentation (valid values: simplex, complex, and max-word. If the input is invalid, max-word is used by default .). DicPath can be an absolute or relative directory (relative solr. in the home directory, dic is stored in solr. find the dictionary file in the home/dic directory. If this parameter is not specified, it is found in the CWD/data directory by default (the data subdirectory of the current directory where the program runs.
Modify the solr configuration file to modify schema. xml. I add three field types as follows:
- <FieldType name = "textComplex" class = "solr. TextField" positionIncrementGap = "100">
- <Analyzer>
- <Tokenizer class = "com. chenlb. mmseg4j. solr. MMSegTokenizerFactory" mode = "complex" dicPath = "dic"/>
- <Filter class = "solr. LowerCaseFilterFactory"/>
- </Analyzer>
- </FieldType>
- <FieldType name = "textMaxWord" class = "solr. TextField" positionIncrementGap = "100">
- <Analyzer>
- <Tokenizer class = "com. chenlb. mmseg4j. solr. MMSegTokenizerFactory" mode = "max-word" dicPath = "dic"/>
- <Filter class = "solr. LowerCaseFilterFactory"/>
- </Analyzer>
- </FieldType>
- <FieldType name = "textSimple" class = "solr. TextField" positionIncrementGap = "100">
- <Analyzer>
- <Tokenizer class = "com. chenlb. mmseg4j. solr. MMSegTokenizerFactory" mode = "simple" dicPath = "n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/>
- <Filter class = "solr. LowerCaseFilterFactory"/>
- </Analyzer>
- </FieldType>
<fieldType name="textComplex" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="textMaxWord" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="textSimple" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Note: The number of different dictionary directories is the number of dictionary array instances. The preceding configuration provides two instances. Note that 1.7.2 may cause memory overflow.
Define several fields:
- <Field name = "simple" type = "textSimple" indexed = "true" stored = "true"/>
- <Field name = "complex" type = "textComplex" indexed = "true" stored = "true"/>
- <Field name = "text" type = "textMaxWord" indexed = "true" stored = "true"/>
<field name="simple" type="textSimple" indexed="true" stored="true"/> <field name="complex" type="textComplex" indexed="true" stored="true"/> <field name="text" type="textMaxWord" indexed="true" stored="true"/>
Add another copyField (add it to the end ):
- <CopyField source = "text" dest = "simple"/>
- <CopyField source = "text" dest = "complex"/>
<copyField source="text" dest="simple" /> <copyField source="text" dest="complex" />
Now mmseg4j is configured in solr. Next, install solr to tomcat.
Solr 1.3 has long been released, so I will use it as an example of solr. Download: solr-1.3.0, such as: extract to N:/OpenSource/apache-solr-1.3.0. For details about how to install solr in tomcat, refer to: Introduction to Solr usage. Here, we will search for Forum posts for examples, such as solr install, solr tomcat, and solr on tomcat.
I am using the installation method of TOMCAT_HOME/conf/Catalina/localhost/solr. xml, pointing to n:/OpenSource/apache-solr-1.3.0/example/solr. Tomcat 6 may not have this directory. manually create this directory.
Start tomcat to view the logs related to mmseg4j. Then, you can view the word segmentation effect of mmseg4j at http: // localhost: 8080/solr/admin/analysis. jsp. Select name from the Field drop-down menu and enter complex in the application. The result of word segmentation, for example:
Mmseg4j solr analysis debugging, click to enlarge
Okay, you can run it, then add a document to try, create a apache-solr-1.3.0 document under n:/OpenSource/mmseg4j-solr-demo-doc.xml/example/exampledocs:
- <Add>
- <Doc>
- <Field name = "id"> 1 </field>
- <Field name = "text"> the Jinghua Times reported yesterday that, due to a strong cold air from China and Siberia, the city experienced strong winds and cooling. The maximum temperature during the day was only minus 7 degrees Celsius, it is accompanied by 6 to 7 northerly winds. </Field>
- </Doc>
- <Doc>
- <Field name = "id"> 2 </field>
- <Field name = "text"> Kim Jong Il arrived in Changchun yesterday to conduct a two-day telephone system inspection in Changchun. </Field>
- </Doc>
- <Doc>
- <Field name = "id"> 3 </field>
- <Field name = "text"> Professor Chen is studying the origin of his life. His graduate students are playing. </Field>
- </Doc>
- <Doc>
- <Field name = "id"> 4 </field>
- <Field name = "text"> The People's Bank of China is the central bank of the People's Republic of China. </Field>
- </Doc>
- </Add>
<Add> <doc> <field name = "id"> 1 </field> <field name = "text"> Jinghua Times reported that yesterday, January 23, 2009, affected by a strong cold air from China and Siberia, the city experienced strong winds and cooling. The highest temperature during the day was only minus 7 degrees Celsius, accompanied by 6 to 7 northerly winds. </Field> </doc> <field name = "id"> 2 </field> <field name = "text"> Kim Jong Il arrived in Changchun yesterday, the two-day telephone system in Changchun City was investigated. </Field> </doc> <field name = "id"> 3 </field> <field name = "text"> Professor Chen is studying the origin of life, his graduate student is playing. </Field> </doc> <field name = "id"> 4 </field> <field name = "text"> The People's Bank of China is the central bank of the People's Republic of China.. </Field> </doc> </add>
Then submit it to solr and run post. jar in cmd, as shown below:
N: \ OpenSource \ apache-solr-1.3.0 \ example \ exampledocs> java-Durl = http: // localhost: 8080/solr/update-Dcommit = yes-jar post. jar mmseg4j-solr-demo-doc.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http: // localhost: 8080/solr/update ..
SimplePostTool: POSTing file mmseg4j-solr-demo-doc.xml
SimplePostTool: COMMITting Solr index changes ..
NOTE: If mmseg4j-solr-demo-doc.xml is in UTF-8 format, otherwise it will be garbled after submission.
Check whether there is data: http: // localhost: 8080/solr/select /? Q = *: *. If there is data, it should be normal.
Then, find "Siberian ".
Simple: http: // localhost: 8080/solr/select? Indent = on & q = simple: % E8 % A5 % BF % E4 % BC % AF % E5 % 88% A9 % E4 % BA % 9A & hl = on & hl. fl = simple % 2 Ccomplex % 2 Ctext & fl = id, the result is as follows:
- <? Xml version = "1.0" encoding = "UTF-8"?>
- <Response>
- <Lst name = "responseHeader">
- <Int name = "status"> 0 </int>
- <Int name = "QTime"> 0 </int>
- <Lst name = "params">
- <Str name = "fl"> id </str>
- <Str name = "indent"> on </str>
- <Str name = "q"> simple: Siberian </str>
- <Str name = "hl. fl"> simple, complex, text </str>
- <Str name = "hl"> on </str>
- </Lst>
- </Lst>
- <Result name = "response" numFound = "0" start = "0"/>
- <Lst name = "highlighting"/>
- </Response>
<? Xml version = "1.0" encoding = "UTF-8"?> <Response> <lst name = "responseHeader"> <int name = "status"> 0 </int> <int name = "QTime"> 0 </int> <lst name = "params"> <str name = "fl"> id </str> <str name = "indent"> on </str> <str name = "q"> simple: siberian </str> <str name = "hl. fl "> simple, complex, text </str> <str name = "hl"> on </str> </lst> <result name = "response" numFound = "0" start = "0"/> <lst name = "highlighting"/> </response>
Comlex: http: // localhost: 8080/solr/select? Indent = on & q = complex: % E8 % A5 % BF % E4 % BC % AF % E5 % 88% A9 % E4 % BA % 9A & hl = on & hl. fl = simple % 2 Ccomplex % 2 Ctext & fl = id. The result is as follows:
- <? Xml version = "1.0" encoding = "UTF-8"?>
- <Response>
- <Lst name = "responseHeader">
- <Int name = "status"> 0 </int>
- <Int name = "QTime"> 0 </int>
- <Lst name = "params">
- <Str name = "fl"> id </str>
- <Str name = "indent"> on </str>
- <Str name = "q"> complex: Siberia </str>
- <Str name = "hl. fl"> simple, complex, text </str>
- <Str name = "hl"> on </str>
- </Lst>
- </Lst>
- <Result name = "response" numFound = "1" start = "0">
- <Doc>
- <Str name = "id"> 1 </str>
- </Doc>
- </Result>
- <Lst name = "highlighting">
- <Lst name = "1">
- <Arr name = "complex">
- <Str> Jinghua Times reported on July 22, January 23, 2009 that the city experienced strong winds and cooling due to a cold air from <em> Siberia </em>, the maximum temperature during the day is only minus 7 degrees Celsius, accompanied by 6 to 7 northerly winds. </Str>
- </Arr>
- </Lst>
- </Lst>
- </Response>
<? Xml version = "1.0" encoding = "UTF-8"?> <Response> <lst name = "responseHeader"> <int name = "status"> 0 </int> <int name = "QTime"> 0 </int> <lst name = "params"> <str name = "fl"> id </str> <str name = "indent"> on </str> <str name = "q"> complex: siberian </str> <str name = "hl. fl "> simple, complex, text </str> <str name = "hl"> on </str> </lst> <result name = "response" numFound = "1" start = "0"> <doc> <str name = "id"> 1 </str> </doc> </result> <lst name = "highlighting"> < Lst name = "1"> <arr name = "complex"> <str> the Jinghua Times reported yesterday on July 15, January 23, 2009, affected by a strong cold air from <em> Siberia </em>, the city experienced strong winds and cooling, and the maximum daytime temperature was only seven degrees Celsius, it is accompanied by 6 to 7 northerly winds. </Str> </arr> </lst> </response>
Text (actually max-word): http: // localhost: 8080/solr/select? Indent = on & q = text: % E8 % A5 % BF % E4 % BC % AF % E5 % 88% A9 % E4 % BA % 9A & hl = on & hl. fl = simple % 2 Ccomplex % 2 Ctext & fl = id, result:
- <? Xml version = "1.0" encoding = "UTF-8"?>
- <Response>
- <Lst name = "responseHeader">
- <Int name = "status"> 0 </int>
- <Int name = "QTime"> 15 </int>
- <Lst name = "params">
- <Str name = "fl"> id </str>
- <Str name = "indent"> on </str>
- <Str name = "q"> text: Siberian </str>
- <Str name = "hl. fl"> simple, complex, text </str>
- <Str name = "hl"> on </str>
- </Lst>
- </Lst>
- <Result name = "response" numFound = "1" start = "0">
- <Doc>
- <Str name = "id"> 1 </str>
- </Doc>
- </Result>
- <Lst name = "highlighting">
- <Lst name = "1">
- <Arr name = "text">
- <Str> Jinghua Times reported that yesterday, due to a strong cold air from <em> West </em> <em> middle </em>, the city experienced strong winds and cooling, and the maximum temperature during the day was only minus 7 degrees Celsius, accompanied by 6 to 7 northerly winds. </Str>
- </Arr>
- </Lst>
- </Lst>
- </Response>
The following describes NGramTokenizerFactory word segmentation settings.
NGramTokenizerFactory
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index" >
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query" >
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>