Solr Chinese Word Segmentation mmseg4j example, NGramTokenizerFactory

Source: Internet
Author: User
Tags solr install
Example of solr Chinese Word Segmentation mmseg4j

Copyright information: this information can be reproduced at will. During reprinting, you must mark the article as a hyperlink.Source, That is, the following statement.

Source: http://blog.chenlb.com/2009/04/solr-chinese-segment-mmseg4j-use-demo.html

The first version of mmseg4j can be easily integrated with solr. There is a simple description on google code, and the first version of the blog also has a simple usage Description: Chinese Word Segmentation mmseg4j. To better illustrate how to use mmseg4j Chinese Word Segmentation in solr, write a blog.

At present, there are two versions of mmseg4j and version 1.7 are relatively memory-consuming (a dictionary directory is about 50 MB), so the default jvm memory size will throw OutOfMemoryErroy. I will use two dictionary directories here, so the latest version 1.7.2 is not needed. Version 1.6.2 is used. Download: mmseg4j-1.6.2 and word library, or download a source package (including the word library, from the source code build please see: Chinese Word Segmentation mmseg4j 1.7.2 release), put the mmseg4j-all-1.6.2.jar to solr. home/lib.

Mmseg4j mainly supports two parameters in solr: mode and dicPath. Mode indicates the mode word segmentation (valid values: simplex, complex, and max-word. If the input is invalid, max-word is used by default .). DicPath can be an absolute or relative directory (relative solr. in the home directory, dic is stored in solr. find the dictionary file in the home/dic directory. If this parameter is not specified, it is found in the CWD/data directory by default (the data subdirectory of the current directory where the program runs.

Modify the solr configuration file to modify schema. xml. I add three field types as follows:

  1. <FieldType name = "textComplex" class = "solr. TextField" positionIncrementGap = "100">
  2. <Analyzer>
  3. <Tokenizer class = "com. chenlb. mmseg4j. solr. MMSegTokenizerFactory" mode = "complex" dicPath = "dic"/>
  4. <Filter class = "solr. LowerCaseFilterFactory"/>
  5. </Analyzer>
  6. </FieldType>
  7. <FieldType name = "textMaxWord" class = "solr. TextField" positionIncrementGap = "100">
  8. <Analyzer>
  9. <Tokenizer class = "com. chenlb. mmseg4j. solr. MMSegTokenizerFactory" mode = "max-word" dicPath = "dic"/>
  10. <Filter class = "solr. LowerCaseFilterFactory"/>
  11. </Analyzer>
  12. </FieldType>
  13. <FieldType name = "textSimple" class = "solr. TextField" positionIncrementGap = "100">
  14. <Analyzer>
  15. <Tokenizer class = "com. chenlb. mmseg4j. solr. MMSegTokenizerFactory" mode = "simple" dicPath = "n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/>
  16. <Filter class = "solr. LowerCaseFilterFactory"/>
  17. </Analyzer>
  18. </FieldType>
<fieldType name="textComplex" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="textMaxWord" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="dic"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="textSimple" class="solr.TextField" positionIncrementGap="100" > <analyzer> <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="n:/OpenSource/apache-solr-1.3.0/example/solr/my_dic"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> 

Note: The number of different dictionary directories is the number of dictionary array instances. The preceding configuration provides two instances. Note that 1.7.2 may cause memory overflow.

Define several fields:

  1. <Field name = "simple" type = "textSimple" indexed = "true" stored = "true"/>
  2. <Field name = "complex" type = "textComplex" indexed = "true" stored = "true"/>
  3. <Field name = "text" type = "textMaxWord" indexed = "true" stored = "true"/>
<field name="simple" type="textSimple" indexed="true" stored="true"/> <field name="complex" type="textComplex" indexed="true" stored="true"/> <field name="text" type="textMaxWord" indexed="true" stored="true"/> 

Add another copyField (add it to the end ):

  1. <CopyField source = "text" dest = "simple"/>
  2. <CopyField source = "text" dest = "complex"/>
<copyField source="text" dest="simple" /> <copyField source="text" dest="complex" /> 

Now mmseg4j is configured in solr. Next, install solr to tomcat.

Solr 1.3 has long been released, so I will use it as an example of solr. Download: solr-1.3.0, such as: extract to N:/OpenSource/apache-solr-1.3.0. For details about how to install solr in tomcat, refer to: Introduction to Solr usage. Here, we will search for Forum posts for examples, such as solr install, solr tomcat, and solr on tomcat.

I am using the installation method of TOMCAT_HOME/conf/Catalina/localhost/solr. xml, pointing to n:/OpenSource/apache-solr-1.3.0/example/solr. Tomcat 6 may not have this directory. manually create this directory.

Start tomcat to view the logs related to mmseg4j. Then, you can view the word segmentation effect of mmseg4j at http: // localhost: 8080/solr/admin/analysis. jsp. Select name from the Field drop-down menu and enter complex in the application. The result of word segmentation, for example:

Mmseg4j solr analysis debugging, click to enlarge

Okay, you can run it, then add a document to try, create a apache-solr-1.3.0 document under n:/OpenSource/mmseg4j-solr-demo-doc.xml/example/exampledocs:

  1. <Add>
  2. <Doc>
  3. <Field name = "id"> 1 </field>
  4. <Field name = "text"> the Jinghua Times reported yesterday that, due to a strong cold air from China and Siberia, the city experienced strong winds and cooling. The maximum temperature during the day was only minus 7 degrees Celsius, it is accompanied by 6 to 7 northerly winds. </Field>
  5. </Doc>
  6. <Doc>
  7. <Field name = "id"> 2 </field>
  8. <Field name = "text"> Kim Jong Il arrived in Changchun yesterday to conduct a two-day telephone system inspection in Changchun. </Field>
  9. </Doc>
  10. <Doc>
  11. <Field name = "id"> 3 </field>
  12. <Field name = "text"> Professor Chen is studying the origin of his life. His graduate students are playing. </Field>
  13. </Doc>
  14. <Doc>
  15. <Field name = "id"> 4 </field>
  16. <Field name = "text"> The People's Bank of China is the central bank of the People's Republic of China. </Field>
  17. </Doc>
  18. </Add>
<Add> <doc> <field name = "id"> 1 </field> <field name = "text"> Jinghua Times reported that yesterday, January 23, 2009, affected by a strong cold air from China and Siberia, the city experienced strong winds and cooling. The highest temperature during the day was only minus 7 degrees Celsius, accompanied by 6 to 7 northerly winds. </Field> </doc> <field name = "id"> 2 </field> <field name = "text"> Kim Jong Il arrived in Changchun yesterday, the two-day telephone system in Changchun City was investigated. </Field> </doc> <field name = "id"> 3 </field> <field name = "text"> Professor Chen is studying the origin of life, his graduate student is playing. </Field> </doc> <field name = "id"> 4 </field> <field name = "text"> The People's Bank of China is the central bank of the People's Republic of China.. </Field> </doc> </add>

Then submit it to solr and run post. jar in cmd, as shown below:

N: \ OpenSource \ apache-solr-1.3.0 \ example \ exampledocs> java-Durl = http: // localhost: 8080/solr/update-Dcommit = yes-jar post. jar mmseg4j-solr-demo-doc.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http: // localhost: 8080/solr/update ..
SimplePostTool: POSTing file mmseg4j-solr-demo-doc.xml
SimplePostTool: COMMITting Solr index changes ..

NOTE: If mmseg4j-solr-demo-doc.xml is in UTF-8 format, otherwise it will be garbled after submission.

Check whether there is data: http: // localhost: 8080/solr/select /? Q = *: *. If there is data, it should be normal.

Then, find "Siberian ".

Simple: http: // localhost: 8080/solr/select? Indent = on & q = simple: % E8 % A5 % BF % E4 % BC % AF % E5 % 88% A9 % E4 % BA % 9A & hl = on & hl. fl = simple % 2 Ccomplex % 2 Ctext & fl = id, the result is as follows:

  1. <? Xml version = "1.0" encoding = "UTF-8"?>
  2. <Response>
  3. <Lst name = "responseHeader">
  4. <Int name = "status"> 0 </int>
  5. <Int name = "QTime"> 0 </int>
  6. <Lst name = "params">
  7. <Str name = "fl"> id </str>
  8. <Str name = "indent"> on </str>
  9. <Str name = "q"> simple: Siberian </str>
  10. <Str name = "hl. fl"> simple, complex, text </str>
  11. <Str name = "hl"> on </str>
  12. </Lst>
  13. </Lst>
  14. <Result name = "response" numFound = "0" start = "0"/>
  15. <Lst name = "highlighting"/>
  16. </Response>
<? Xml version = "1.0" encoding = "UTF-8"?> <Response> <lst name = "responseHeader"> <int name = "status"> 0 </int> <int name = "QTime"> 0 </int> <lst name = "params"> <str name = "fl"> id </str> <str name = "indent"> on </str> <str name = "q"> simple: siberian </str> <str name = "hl. fl "> simple, complex, text </str> <str name = "hl"> on </str> </lst> <result name = "response" numFound = "0" start = "0"/> <lst name = "highlighting"/> </response>

Comlex: http: // localhost: 8080/solr/select? Indent = on & q = complex: % E8 % A5 % BF % E4 % BC % AF % E5 % 88% A9 % E4 % BA % 9A & hl = on & hl. fl = simple % 2 Ccomplex % 2 Ctext & fl = id. The result is as follows:

  1. <? Xml version = "1.0" encoding = "UTF-8"?>
  2. <Response>
  3. <Lst name = "responseHeader">
  4. <Int name = "status"> 0 </int>
  5. <Int name = "QTime"> 0 </int>
  6. <Lst name = "params">
  7. <Str name = "fl"> id </str>
  8. <Str name = "indent"> on </str>
  9. <Str name = "q"> complex: Siberia </str>
  10. <Str name = "hl. fl"> simple, complex, text </str>
  11. <Str name = "hl"> on </str>
  12. </Lst>
  13. </Lst>
  14. <Result name = "response" numFound = "1" start = "0">
  15. <Doc>
  16. <Str name = "id"> 1 </str>
  17. </Doc>
  18. </Result>
  19. <Lst name = "highlighting">
  20. <Lst name = "1">
  21. <Arr name = "complex">
  22. <Str> Jinghua Times reported on July 22, January 23, 2009 that the city experienced strong winds and cooling due to a cold air from <em> Siberia </em>, the maximum temperature during the day is only minus 7 degrees Celsius, accompanied by 6 to 7 northerly winds. </Str>
  23. </Arr>
  24. </Lst>
  25. </Lst>
  26. </Response>
<? Xml version = "1.0" encoding = "UTF-8"?> <Response> <lst name = "responseHeader"> <int name = "status"> 0 </int> <int name = "QTime"> 0 </int> <lst name = "params"> <str name = "fl"> id </str> <str name = "indent"> on </str> <str name = "q"> complex: siberian </str> <str name = "hl. fl "> simple, complex, text </str> <str name = "hl"> on </str> </lst> <result name = "response" numFound = "1" start = "0"> <doc> <str name = "id"> 1 </str> </doc> </result> <lst name = "highlighting"> < Lst name = "1"> <arr name = "complex"> <str> the Jinghua Times reported yesterday on July 15, January 23, 2009, affected by a strong cold air from <em> Siberia </em>, the city experienced strong winds and cooling, and the maximum daytime temperature was only seven degrees Celsius, it is accompanied by 6 to 7 northerly winds. </Str> </arr> </lst> </response>

Text (actually max-word): http: // localhost: 8080/solr/select? Indent = on & q = text: % E8 % A5 % BF % E4 % BC % AF % E5 % 88% A9 % E4 % BA % 9A & hl = on & hl. fl = simple % 2 Ccomplex % 2 Ctext & fl = id, result:

  1. <? Xml version = "1.0" encoding = "UTF-8"?>
  2. <Response>
  3. <Lst name = "responseHeader">
  4. <Int name = "status"> 0 </int>
  5. <Int name = "QTime"> 15 </int>
  6. <Lst name = "params">
  7. <Str name = "fl"> id </str>
  8. <Str name = "indent"> on </str>
  9. <Str name = "q"> text: Siberian </str>
  10. <Str name = "hl. fl"> simple, complex, text </str>
  11. <Str name = "hl"> on </str>
  12. </Lst>
  13. </Lst>
  14. <Result name = "response" numFound = "1" start = "0">
  15. <Doc>
  16. <Str name = "id"> 1 </str>
  17. </Doc>
  18. </Result>
  19. <Lst name = "highlighting">
  20. <Lst name = "1">
  21. <Arr name = "text">
  22. <Str> Jinghua Times reported that yesterday, due to a strong cold air from <em> West </em> <em> middle </em>, the city experienced strong winds and cooling, and the maximum temperature during the day was only minus 7 degrees Celsius, accompanied by 6 to 7 northerly winds. </Str>
  23. </Arr>
  24. </Lst>
  25. </Lst>
  26. </Response>

The following describes NGramTokenizerFactory word segmentation settings.

NGramTokenizerFactory

 

 <fieldType name="text" class="solr.TextField"  positionIncrementGap="100">
      <analyzer type="index" >
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query"  >
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>    
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.