[Big Data]-fscrawler import file (Txt,html,pdf,worf ... ) to Elasticsearch5.3.1 and configure synonym filtering

Source: Internet
Author: User
Tags set time

Fscrawler is a file import plug-in ES, only need simple configuration can be implemented to import the local file system files into ES for retrieval, while supporting a rich file format (Txt.pdf,html,word ... And so on Below is a detailed description of how the next Fscrawler works and is configured.

First, the simple use of Fscrawler:

1, Download: wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.2/fscrawler-2.2. Zip

2, Decompression: unzip fscrawler-2.2.zip directory is as follows: Bin under two scripts, Lib is all the jar package.

3, start: bin/fscrawler job_name job_name need to set their own, the first time to start the job will create a related _setting.json used to configure the file and ES related information. As follows:

    • Edit this file: vim ~/.fscrawler/job_1/_settting.json modified as follows:
    • Name represents a job name that is also the index,url of the ES data: Represents the folder in which the file needs to be imported. Update_rate: Indicates how often to refresh, host: The IP address and port number of the connection es. Type: Represents the type of ES. After you've changed the save to run, FS will import the data.

    • Import data (will open a thread, according to the set time for the data refresh, we modify the file es can also get new data):bin/fscrawler job_name
Second, fscrawler configure IK word breakers and synonym filtering:
  • After initializing a job, the system generates three profiles: Doc.json,folder.json,_setting.json (1,2,5 represents the version number of ES, we are the 5.x version to modify the configuration file under the 5 folder. These three files are used to create a index,mapping.
  • Configure IK Participle first configure analysis in _default/5/_setting.json: Delete the original configuration file and add the following:
  • {    "Settings": {        " Analysis": {            "Analyzer": {                "By_smart": {                    "type":"Custom",                    "Tokenizer":"Ik_smart",                    "Filter": [                        "BY_TFR",                        "BY_SFR"                    ],                    "Char_filter": [                        "BY_CFR"                    ]                },                "By_max_word": {                    "type":"Custom",                    "Tokenizer":"Ik_max_word",                    "Filter": [                        "BY_TFR",                        "BY_SFR"                    ],                    "Char_filter": [                        "BY_CFR"                    ]                }            },            "Filter": {                "BY_TFR": {                    "type":"Stop",                    "Stopwords": [                        " "                    ]                },                "BY_SFR": {                    "type":"synonym",                    "Synonyms_path":"Analysis/synonyms.txt"                }            },            "Char_filter": {                "BY_CFR": {                    "type":"Mapping",                    "Mappings": [                        "| = |"                    ]                }            }        }    }}

    As mentioned in the previous several blog of the custom word breaker to create a synonym filter is the same, the filter can choose to delete, keep the necessary parts, so we customize the two kinds of word breakers: By_smart,by_max_word.

  • Modify _default/5/doc.json: Remove all the fields of the word breaker; Analyzer: "xxx", because there is only one field that needs word breaker that is content (the contents of the file), add word breakers to the content node. As follows:
  • " content " : {      "type""text",        " Analyzer ":"by_max_word"#添加此行 ...     },
  • The configuration is complete, the same start job again: Bin/fscrawler job_name
  • Access 9100: You can see that index has been created, such as:
  • Synonym query: I configured the synonyms in the tomato and tomato, in the/tmp/es folder, a file containing tomatoes and tomatoes, 9100 port with the following statement query:
  • {    "Query": {        "Match": {            "content":"Tomato"        }    },    "Highlight": {        "Pre_tags": [            "<tag1>",            "<tag2>"        ],        "Post_tags": [            "</tag1>",            "</tag2>"        ],        " Fields": {            "content": {}        }    }}

    The results are as follows:

  • {    "hits": [        {            "_index":"Jb_8",            "_type":"Doc",            "_id":"3a15a979b4684d8a5d86136257888d73",            "_score": 0.49273878,            "_source": {                "content":"I like to eat tomato and egg noodles. And I like tomatoes, scrambled eggs, rice .",                "Meta": {                    "Raw": {                        "x-parsed-by":"Org.apache.tika.parser.DefaultParser",                        "content-encoding":"UTF-8",                        "Content-type":"Text/plain;charset=utf-8"                    }                },                "file": {                    "extension":"txt",                    "Content_Type":"Text/plain;charset=utf-8",                    "last_modified":"2017-05-24t10:22:31",                    "indexing_date":"2017-05-25t14:08:10.881",                    "filesize": 55,                    "filename":"Sy.txt",                    "URL":"file:///tmp/es/sy.txt"                },                "Path": {                    "encoded":"824b64ab42d4b63cda6e747e2b80e5",                    "Root":"824b64ab42d4b63cda6e747e2b80e5",                    "Virtual":"/",                    "Real":"/tmp/es/sy.txt"                }            },            "Highlight": {                "content": [                    "I like to eat <tag1> tomato </tag1> egg noodles. Also like <tag1> tomato </tag1> scrambled egg rice"                ]            }        }    ]}
  • The complete IK word synonym filter is configured to complete.

  • In the txt,html format, other formats are available, but the file names are garbled in Chinese.

Attention:

To select the version of fs2.2, the 2.1 version fails to connect on the es of 5.3.1.

[Big Data]-fscrawler import file (Txt,html,pdf,worf ... ) to Elasticsearch5.3.1 and configure synonym filtering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.