[Big Data]-fscrawler import file (Txt,html,pdf,worf ... ) to Elasticsearch5.3.1 and configure synonym filtering

Last Update:2017-05-25 Source: Internet

Author: User

Tags set time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Fscrawler is a file import plug-in ES, only need simple configuration can be implemented to import the local file system files into ES for retrieval, while supporting a rich file format (Txt.pdf,html,word ... And so on Below is a detailed description of how the next Fscrawler works and is configured.

First, the simple use of Fscrawler:

1, Download: wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.2/fscrawler-2.2. Zip

2, Decompression: unzip fscrawler-2.2.zip directory is as follows: Bin under two scripts, Lib is all the jar package.

3, start: bin/fscrawler job_name job_name need to set their own, the first time to start the job will create a related _setting.json used to configure the file and ES related information. As follows:

Edit this file: vim ~/.fscrawler/job_1/_settting.json modified as follows:
Name represents a job name that is also the index,url of the ES data: Represents the folder in which the file needs to be imported. Update_rate: Indicates how often to refresh, host: The IP address and port number of the connection es. Type: Represents the type of ES. After you've changed the save to run, FS will import the data.
Import data (will open a thread, according to the set time for the data refresh, we modify the file es can also get new data):bin/fscrawler job_name

Second, fscrawler configure IK word breakers and synonym filtering:

After initializing a job, the system generates three profiles: Doc.json,folder.json,_setting.json (1,2,5 represents the version number of ES, we are the 5.x version to modify the configuration file under the 5 folder. These three files are used to create a index,mapping.
Configure IK Participle first configure analysis in _default/5/_setting.json: Delete the original configuration file and add the following:

{    "Settings": {        " Analysis": {            "Analyzer": {                "By_smart": {                    "type":"Custom",                    "Tokenizer":"Ik_smart",                    "Filter": [                        "BY_TFR",                        "BY_SFR"                    ],                    "Char_filter": [                        "BY_CFR"                    ]                },                "By_max_word": {                    "type":"Custom",                    "Tokenizer":"Ik_max_word",                    "Filter": [                        "BY_TFR",                        "BY_SFR"                    ],                    "Char_filter": [                        "BY_CFR"                    ]                }            },            "Filter": {                "BY_TFR": {                    "type":"Stop",                    "Stopwords": [                        " "                    ]                },                "BY_SFR": {                    "type":"synonym",                    "Synonyms_path":"Analysis/synonyms.txt"                }            },            "Char_filter": {                "BY_CFR": {                    "type":"Mapping",                    "Mappings": [                        "| = |"                    ]                }            }        }    }}

As mentioned in the previous several blog of the custom word breaker to create a synonym filter is the same, the filter can choose to delete, keep the necessary parts, so we customize the two kinds of word breakers: By_smart,by_max_word.

Modify _default/5/doc.json: Remove all the fields of the word breaker; Analyzer: "xxx", because there is only one field that needs word breaker that is content (the contents of the file), add word breakers to the content node. As follows:

" content " : {      "type""text",        " Analyzer ":"by_max_word"#添加此行 ...     },

The configuration is complete, the same start job again: Bin/fscrawler job_name
Access 9100: You can see that index has been created, such as:
Synonym query: I configured the synonyms in the tomato and tomato, in the/tmp/es folder, a file containing tomatoes and tomatoes, 9100 port with the following statement query:

{    "Query": {        "Match": {            "content":"Tomato"        }    },    "Highlight": {        "Pre_tags": [            "<tag1>",            "<tag2>"        ],        "Post_tags": [            "</tag1>",            "</tag2>"        ],        " Fields": {            "content": {}        }    }}

The results are as follows:

{    "hits": [        {            "_index":"Jb_8",            "_type":"Doc",            "_id":"3a15a979b4684d8a5d86136257888d73",            "_score": 0.49273878,            "_source": {                "content":"I like to eat tomato and egg noodles. And I like tomatoes, scrambled eggs, rice .",                "Meta": {                    "Raw": {                        "x-parsed-by":"Org.apache.tika.parser.DefaultParser",                        "content-encoding":"UTF-8",                        "Content-type":"Text/plain;charset=utf-8"                    }                },                "file": {                    "extension":"txt",                    "Content_Type":"Text/plain;charset=utf-8",                    "last_modified":"2017-05-24t10:22:31",                    "indexing_date":"2017-05-25t14:08:10.881",                    "filesize": 55,                    "filename":"Sy.txt",                    "URL":"file:///tmp/es/sy.txt"                },                "Path": {                    "encoded":"824b64ab42d4b63cda6e747e2b80e5",                    "Root":"824b64ab42d4b63cda6e747e2b80e5",                    "Virtual":"/",                    "Real":"/tmp/es/sy.txt"                }            },            "Highlight": {                "content": [                    "I like to eat <tag1> tomato </tag1> egg noodles. Also like <tag1> tomato </tag1> scrambled egg rice"                ]            }        }    ]}

The complete IK word synonym filter is configured to complete.
In the txt,html format, other formats are available, but the file names are garbled in Chinese.

Attention:

To select the version of fs2.2, the 2.1 version fails to connect on the es of 5.3.1.

[Big Data]-fscrawler import file (Txt,html,pdf,worf ... ) to Elasticsearch5.3.1 and configure synonym filtering

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More