Using the IK word breaker Java API in Elasticsearch

Source: Internet
Author: User
First, Elasticsearch participle

In the Elasticsearch, the Chinese participle is supported, but all the participle is in accordance with the word word, such as the standard word breaker standard, you can follow the way to query how to do participle

Http://localhost:9200/iktest/_analyze?pretty&analyzer=standard&text= People's Republic of China

The above example uses the standard to carry on the participle, the participle result is as follows:

{"tokens": [{"token": "Medium", "Start_offset": 0, "End_offset": 1, "type": "<i Deographic> "," position ": 0}, {" token ":" Wah "," Start_offset ": 1," End_offset ": 2," type ":" <IDEOGRAPHIC> "," Position ": 1}, {" token ":" Person "," Start_offset ": 2," End_offset ": 3," type
    ": <IDEOGRAPHIC>", "Position": 2}, {"token": "People", "Start_offset": 3, "End_offset": 4, 
    "Type": "<IDEOGRAPHIC>", "Position": 3}, {"token": "Total", "Start_offset": 4, "End_offset": 5,  "Type": "<IDEOGRAPHIC>", "Position": 4}, {"token": "and", "Start_offset": 5, "End_offset" : 6, "type": "<IDEOGRAPHIC>", "Position": 5}, {"token": "Country", "Start_offset": 6, "end_of Fset ": 7," type ":" <IDEOGRAPHIC> "," Position ": 6}]} 
From the result can be seen, for the band-breaker is to each word for segmentation, but if you follow this way, the search results may be a lot of search by word for this, affect the search results, we want more intelligent word segmentation method, for the ES more friendly a word breaker is an IK word breaker, Direct download can be used

second, the installation of IK word breaker

IK participle is a used in the use of ES when the word breaker, as long as the GitHub on the download can, download the address below

Https://github.com/medcl/elasticsearch-analysis-ik/releases
To download the version of the ES that you are using

IK version	ES version
master	2.1.1-> master
1.7.0	2.1.1
1.6.1	2.1.0 1.5.0 2.0.0
1.4.1	1.7.2
1.4.0	1.6.0
1.3.0	1.5.0 1.2.9 1.4.0 1.2.8 1.3.2
1.2.7	1.2.1
1.2.6	1.0.0
1.2.5	0.90.x
1.1.3	0.20.x
1.1.2	0.19.x
1.0.0	0.16.2-> 0.19.0
As shown above, individuals believe that a version of the high will be compatible with the low version

After the download is complete, unpack and then package using MVN package, where you need to install MAVEN, and how to install yourself Baidu

After the package is complete, a target/releases/elasticsearch-analysis-ik-{version}.zip will appear

Copy the zip file to the/plugins/ik in the ES directory
Unzip the zip file, after the decompression is completed, you need to modify the Plugin-descriptor.properties file, the Java version, and the ES version number are changed to the version number you use, that is, the completion of the IK word breaker installation three, Detection of the effect of IK word breaker

When the installation completes, uses the previous command to carry on the detection, because the IK word breaker divides into two kinds of participle method, one is the biggest segmentation, one is the whole segmentation, corresponding name is Ik_smart,ik_max_word, in which the smart segmentation more conforms to the daily use, Max_ Word's cutting branch more meticulous, such as GitHub above, the following for a given sentence we carry out a test, the sentence is: The People's Republic of China

IK_SAMRT Segmentation Results:

Http://localhost:9200/iktest/_analyze?pretty&analyzer=ik_smart&text= People's Republic of China
{"
  tokens": [{"
    token": "People's Republic of China",
    "Start_offset": 0,
    "End_offset": 7, "
    type": "Cn_word",
    " Position ': 0
  }]
}
The largest segmentation divides a People's Republic of China directly into a finished word

Ik_max_word Segmentation:

Http://localhost:9200/iktest/_analyze?pretty&analyzer=ik_max_word&text= People's Republic of China
{"Tokens": [{"token": "People's Republic of China", "Start_offset": 0, "End_offset": 7, "type": "Cn_word", "Pos Ition ": 0}, {" token ":" Chinese People "," Start_offset ": 0," End_offset ": 4," type ":" Cn_word "," Positio 
  N ": 1}, {" token ":" China "," Start_offset ": 0," End_offset ": 2," type ":" Cn_word "," position ": 2 }, {"token": "Chinese", "Start_offset": 1, "End_offset": 3, "type": "Cn_word", "Position": 3},
    {"token": "People's Republic", "Start_offset": 2, "End_offset": 7, "type": "Cn_word", "Position": 4}, { "token": "People", "Start_offset": 2, "End_offset": 4, "type": "Cn_word", "Position": 5}, {"To Ken ":" Republic "," Start_offset ": 4," End_offset ": 7," type ":" Cn_word "," Position ": 6}, {" token " 
    : "Republic", "Start_offset": 4, "End_offset": 6, "type": "Cn_word", "Position": 7}, {"token": "Country", "Start_offset": 6,
    "End_offset": 7, "type": "Cn_char", "Position": 8}]} 
The above results indicate that Ik_max_word's participle will be more detailed.

Four, the use of two different participle and the difference: 1, the use of different

In which we do the indexing, want to be able to split all the sentences in more detail, so that better search, so ik_max_word more use in indexing, but in the search, for the user entered query (query) Word, we may prefer more accurate results, such as, When we search for "fig", we would prefer to query as a word, rather than cut into "no", "flower", "fruit" three words to carry out the recall of the result, so ik_smart more often term for the analysis of the input words 2, the difference in efficiency

Ik_max_word participle relative efficiency more quickly, and ik_smart efficiency than ik_max_word (individual do the index when the two kinds of word breaker to try to come to the results, wrong words, look) Five, Java API implementation of the specified word breaker

The actual application, we may be in the program to implement the designation of the word breaker, and the above is described directly in the Web page to see the results, then how to specify the word breaker it. How do I implement it in Java code?

After looking, finally found three ways to specify the word breaker (1) in the construction of mapping for the field to specify

When constructing mapping, we can use the specified word breaker for the specified field, the Java code used is as follows:


Private Xcontentbuilder createikmapping (String indextype) {Xcontentbuilder mapping = null;
                    try {mapping = Xcontentfactory.jsonbuilder (). StartObject ()//index library name (similar to tables in a database) . StartObject (Indextype). StartObject ("Properties"). StartObject ("Product_Name"). Field ("Type", " String "). Field (" Analyzer "," IK "). Field (" Search_analyzer "," Ik_smart "). EndObject (). StartObject ("Title_sub"). Field ("Type", "string"). Field ("Analyzer", "IK"). Field ("Search_analyzer", "ik_s Mart "). EndObject (). StartObject (" Title_primary "). Field (" Type "," string "). Field (" A  Nalyzer "," ik "). Field (" Search_analyzer "," Ik_smart "). EndObject (). StartObject (" publisher "). Field (" type ").
                    "string"). Field ("Analyzer", "IK"). Field ("Search_analyzer", "Ik_smart"). EndObject () . StartObject ("Author_name"). Field ("TYpe "," string "). Field (" Analyzer "," IK "). Field (" Search_analyzer "," Ik_smart "). EndObject ()  . Field ("Boost"). EndObject ()//Name//.startobject ("name"). Field ("Type", "string"). EndObject ()//Position//.startobject ("location"). Field ("Type", "Geo_point") . EndObject ()//.endobject (). StartObject ("_all"). Field ("Analyzer", "IK"). Field ("Search_analyzer", "ik"). EndObject
                    (). EndObject (). EndObject ();
        . EndObject (). EndObject (). EndObject ();
        catch (IOException e) {e.printstacktrace ();
    return mapping; }
That is, when indexing several fields using the IK word breaker that ik_max_word, in search of the time to use Ik_smart, the above tested successfully

(2) for all fields to be specified

This method did not pass the test, just know that there is this method, through the classmate trouble with me, thank you

If you are following the example of IK, the DSL statement used is as follows:

Curl-xpost http://localhost:9200/index/fulltext/_mapping-d '
{
    "fulltext": {"
             _all": {
            "Analyzer": "Ik_max_word",
            "Search_analyzer": "Ik_max_word",
            "Term_vector": "No",
            "store": "False"
        },
        " Properties ": {"
            content ": {"
                type ":" string ",
                " store ":" No ",
                " Term_vector ":" With_positions_ Offsets ",
                Analyzer": "Ik_max_word",
                "Search_analyzer": "Ik_max_word",
                "Include_in_all": "true",
                "Boost": 8
            }}}
'
That is, in the _all field to set up, according to this idea, I wrote the following Java code, confirmed not to, hope that the Almighty told me

Private  Xcontentbuilder createikmapping (String indextype) {
        Xcontentbuilder mapping = null;
        try {
            mapping = Xcontentfactory.jsonbuilder (). StartObject ()
                    //Index library name (similar to a table in the database)
                    . StartObject (Indextype). StartObject ("Properties"). EndObject ().
		   startobject ("_all"). Field ("Analyzer", "IK"). Field ("Search_analyzer", "Ik"). EndObject (). EndObject (). EndObject ()
		;
        \ catch (IOException e) {
            e.printstacktrace ();
        }
        return mapping;
    }
After testing, view mapping and then _all the field is really word breaker correct, but the search time obviously can feel the wrong, not clear which problem, just have this method, if which has made out the trouble to inform a, thank you (this I write code is wrong, just here to give a sum, Put forward ideas, may also be the idea is wrong, look not to spray)

(3), in the setting of the time to set

After reading that, in the setting can be set directly analyzer, as shown in the picture:

This method is untested and can only be determined by feasibility.











Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.