IK and pinyin word breaker _elasticsearch

Source: Internet
Author: User
First, the application of pinyin participle

Pinyin participle in the daily life is actually very common, maybe you use every day. Open Taobao look at it, input pinyin "Zhonghua", the following will contain "Zhonghua" corresponding to the Chinese "Chinese" Product tips:

Pinyin segmentation is based on the input of phonetic prompts corresponding to the Chinese, through pinyin word to enhance the search experience, speed up the search speed. The following describes how to configure and implement Pinyin+ik participle in elasticsearch 5.1.1. Two, ik word breaker download and installation

About IK word breaker is no longer how much, word, ik participle is currently using a very wide range of Chinese word segmentation effect is better. To do ES development, Chinese participle of ten uses are IK word breaker.

Download Address: Https://github.com/medcl/elasticsearch-analysis-ik
Turn off Elasticsearch before configuring and reboot after configuration is complete.
The version of IK is consistent with the current ES version, as described in the readme. I am using the ES is 5.1.1,ik version for 5.1.1 (You may wonder why the previous version of IK is 1.X and the next version rises to 5.) X? Because elastic official for the unified version number, the previous ES version is 2.x,logstash version is 2.x, and Kibana version is 4.x,ik version is 1.x, so the version is very confusing. After 5.0, the unified version number, so that you use 5.1.1 es, other software versions also use the 5.1.1 is good.

After downloading into the Elasticsearch-analysis-pinyin-master directory, MVN pack (without installing Maven's own installation), run the command:

    MVN Package
1

After the package is successful, a target folder is generated, in the Elasticsearch-analysis-ik-master/target/releases directory, Find Elasticsearch-analysis-ik-5.1.1.zip, this is the installation file we need. Extract the Elasticsearch-analysis-ik-5.1.1.zip and get the following content:

Commons-codec-1.9.jar
Commons-logging-1.2.jar
config
elasticsearch-analysis-ik-5.1.1.jar
Httpclient-4.5.2.jar
Httpcore-4.4.4.jar
plugin-descriptor.properties
1 2 3 4 5 6 7

Then create a new folder IK in the Elasticsearch-5.1.1/plugins directory, and copy the Elasticsearch-analysis-ik-5.1.1.zip files to ELASTICSEARCH-5.1.1/ Plugins/ik directory. The screenshot is easy to understand.
Third, pinyin word breaker download and installation

Pinyin word breaker Download address:
Https://github.com/medcl/elasticsearch-analysis-pinyin

The installation process, like IK, downloads, packs, and joins ES. This does not repeat the above steps, give the final configuration screenshot
Four, Word segmentation test

After the IK and pinyin participle configuration is complete, restart ES. If an ES error occurs during the reboot, the installation has errors, and no error indicates that the configuration was successful. 4.1 IK participle test

To create an index:

Curl-xput "Http://localhost:9200/index"
1

Test participle effect:

Curl-xpost "Http://localhost:9200/index/_analyze?analyzer=ik_max_word&text= People's Republic of China"
1

Word Segmentation Result:

   {"Tokens": [{"token": "People's Republic of China", "Start_offset": 0, "End_offset": 7, "type": "Cn_
        WORD, "position": 0}, {"token": "Chinese People", "Start_offset": 0, "End_offset": 4,  "Type": "Cn_word", "Position": 1}, {"token": "China", "Start_offset": 0, "End_offset": 2, "type": "Cn_word", "Position": 2}, {"token": "Chinese", "Start_offset": 1, "E  Nd_offset ": 3," type ":" Cn_word "," Position ": 3}, {" token ":" People's Republic "," Start_offset ": 2, "End_offset": 7, "type": "Cn_word", "Position": 4}, {"token": "People", "STA
        Rt_offset ": 2," End_offset ": 4," type ":" Cn_word "," Position ": 5}, {" token ":" Republic ", "Start_offset": 4, "End_offset": 7, "type": "Cn_word", "Position": 6}, {"Tok En ":" The Republic "," starT_offset ": 4," End_offset ": 6," type ":" Cn_word "," Position ": 7}, {" token ":" Country ", "Start_offset": 6, "End_offset": 7, "type": "Cn_char", "Position": 8}, {"token" : "National anthem", "Start_offset": 7, "End_offset": 9, "type": "Cn_word", "Position": 9}]}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26-27--28 29---30 31--32 33 34 35 36 37 38-39 40 41 42 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Use Ik_smart participle:

Curl-xpost "Http://localhost:9200/index/_analyze?analyzer=ik_smart&text= People's Republic of China"
1

Word Segmentation Result:

{"
    tokens": [{"
        token": "People's Republic of China",
        "Start_offset": 0,
        "End_offset": 7, "
        type": "Cn_word",
        " Position ": 0
    }, {
        " token ":" National anthem ",
        " Start_offset ": 7,
        " End_offset ": 9,"
        type ":" Cn_word ",
        " Position ': 1
    }]
}
1 2 3 4, 5 6 7 8 9 10 11 12 13 14 15

Screenshot easy to understand:
4.2 Phonetic Word segmentation test

Test Pinyin participle:

Curl-xpost "http://localhost:9200/index/_analyze?analyzer=pinyin&text= Jacky Cheung"
1

Word Segmentation Result:

{"
    tokens": [{
        "token": "Zhang",
        "Start_offset": 0,
        "end_offset": 1, "
        type": "word",
        " Position ": 0
    }, {
        " token ":" Xue ",
        " Start_offset ": 1,
        " End_offset ": 2,"
        type ":" word ",
        " Position ": 1
    }, {
        " token ":" You ",
        " Start_offset ": 2,
        " End_offset ": 3,"
        type ":" word ",
        "Position": 2
    }, {
        "token": "Zxy",
        "Start_offset": 0,
        "End_offset": 3, "
        type": "word",
        "position": 3
    }]
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 275, Ik+pinyin participle configuration 5.1 Create index and Analyzer settings

Create an index and set the index parser-related properties:

Curl-xput "http://localhost:9200/medcl/"-d '
{
    "index": {"Analysis
        ": {"
            Analyzer":
                {"Ik_ Pinyin_analyzer ": {"
                    type ":" Custom ",
                    " Tokenizer ":" Ik_smart ",
                    " filter ": [" My_pinyin "," Word_delimiter " ]
                }
            ,
            "filter": {
                "My_pinyin": {"
                    type": "Pinyin",
                    "first_letter": "prefix",
                    " Padding_char ":" "}}}}
'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20-21

Create a type and set mapping:

Curl-xpost http://localhost:9200/medcl/folks/_mapping-d '
{
    "folks": {"
        properties": {
            "name": {
                ' type ': ' keyword ',
                ' fields ': {'
                    Pinyin ': {'
                        type ': ' Text ',
                        ' store ': ' No ',
                        ' term_vector ': ' With_ Positions_offsets ",
                        Analyzer": "Ik_pinyin_analyzer",
                        "Boost":}
        }
}'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 5.2 Index test document

Index 2 test documents.
Document 1:

Curl-xpost http://localhost:9200/medcl/folks/andy-d ' {' name ': ' Andy Lau '} '
1

Document 2:

Curl-xpost http://localhost:9200/medcl/folks/tina-d ' {"name": "The national anthem of the People's Republic of China"} '
1 5.3 Test (1) Pinyin participle

The following four life commands can match "Andy Lau"

Curl-xpost "Http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"

curl-xpost "http://localhost:9200/ Medcl/folks/_search?q=name.pinyin:de "

curl-xpost" Http://localhost:9200/medcl/folks/_search?q=name.pinyin: Hua "

curl-xpost" HTTP://LOCALHOST:9200/MEDCL/FOLKS/_SEARCH?Q=NAME.PINYIN:LDH "
1 2 3 4 5 6 7 5.4 Test (2) IK participle test
Curl-xpost "Http://localhost:9200/medcl/_search?pretty"-d '
{
  "query": {"
    match": {"
      Name.pinyin": "National anthem"
    }
  },
  "highlight": {"
    fields": {
      "Name.pinyin": {
    }
}}
1 2 3 4 5 6 7 8 9 10 11 12-13

return Result:

{
  "took": 2,
  "Timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  } ,
  "hits": {
    "total": 1,
    "Max_score": 16.698704,
    "hits": [
      {"
        _index": "MEDCL",
        "_ Type ":" Folks ","
        _id ":" Tina ",
        " _score ": 16.698704,"
        _source ": {
          " name ":" The national anthem of the People's Republic of China "
        },< c19/> "Highlight": {"
          name.pinyin": [
            "<em> People's Republic of China </em><em> national anthem </em>"
          ]
        }
      }
    ]
  }
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28-29

Indicates that the IK word breaker has an effect. 5.3 Test (4) Pinyin+ik participle test:

Curl-xpost "Http://localhost:9200/medcl/_search?pretty"-d '
{
  "query": {
    "match": {
      " Name.pinyin ":" Zhonghua "
    }
  },
  " highlight ": {" Fields ": {"
      Name.pinyin ": {}
}} '
1 2 3 4 5 6 7 8 9 10 11 12-13

return Result:

{
  "took": 3,
  "Timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  } ,
  "hits": {
    "total": 2,
    "Max_score": 5.9814634,
    "hits": [
      {"
        _index": "MEDCL",
        "_ Type ":" Folks ","
        _id ":" Tina ",
        " _score ": 5.9814634,"
        _source ": {
          " name ":" The national anthem of the People's Republic of China "
        },< c19/> "Highlight": {"
          name.pinyin": [
            "<em> People's Republic of China </em> national anthem"
          ]
        }
      },
      {
        "_index": "MEDCL", "
        _type": "Folks",
        "_id": "Andy",
        "_score": 2.2534127,
        "_source": {
   "name": "Andy Lau"
        },
        "highlight": {
          "Name.pinyin": [
            "<em> Andy Lau </em>"
          ]
        }
      }
    ]
  }
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 A.

Screenshot below:

After using pinyin participle, the original field search is added with the. Pinyin suffix, and the search for the original field does not return results:

Vi. Reference Https://github.com/medcl/elasticsearch-analysis-ik Https://github.com/medcl/elasticsearch-analysis-pinyin https://my.oschina.net/xiaohui249/blog/214505

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.