Fscrawler is a file import plug-in ES, only need simple configuration can be implemented to import the local file system files into ES for retrieval, while supporting a rich file format (Txt.pdf,html,word ... And so on Below is a detailed description of how the next Fscrawler works and is configured.
First, the simple use of Fscrawler:
1, Download: wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.2/fscrawler-2.2. Zip
2, Decompression: unzip fscrawler-2.2.zip directory is as follows: Bin under two scripts, Lib is all the jar package.
3, start: bin/fscrawler job_name job_name need to set their own, the first time to start the job will create a related _setting.json used to configure the file and ES related information. As follows:
- Edit this file: vim ~/.fscrawler/job_1/_settting.json modified as follows:
Name represents a job name that is also the index,url of the ES data: Represents the folder in which the file needs to be imported. Update_rate: Indicates how often to refresh, host: The IP address and port number of the connection es. Type: Represents the type of ES. After you've changed the save to run, FS will import the data.
- Import data (will open a thread, according to the set time for the data refresh, we modify the file es can also get new data):bin/fscrawler job_name
Second, fscrawler configure IK word breakers and synonym filtering:
- After initializing a job, the system generates three profiles: Doc.json,folder.json,_setting.json (1,2,5 represents the version number of ES, we are the 5.x version to modify the configuration file under the 5 folder. These three files are used to create a index,mapping.
- Configure IK Participle first configure analysis in _default/5/_setting.json: Delete the original configuration file and add the following:
{ "Settings": { " Analysis": { "Analyzer": { "By_smart": { "type":"Custom", "Tokenizer":"Ik_smart", "Filter": [ "BY_TFR", "BY_SFR" ], "Char_filter": [ "BY_CFR" ] }, "By_max_word": { "type":"Custom", "Tokenizer":"Ik_max_word", "Filter": [ "BY_TFR", "BY_SFR" ], "Char_filter": [ "BY_CFR" ] } }, "Filter": { "BY_TFR": { "type":"Stop", "Stopwords": [ " " ] }, "BY_SFR": { "type":"synonym", "Synonyms_path":"Analysis/synonyms.txt" } }, "Char_filter": { "BY_CFR": { "type":"Mapping", "Mappings": [ "| = |" ] } } } }}
As mentioned in the previous several blog of the custom word breaker to create a synonym filter is the same, the filter can choose to delete, keep the necessary parts, so we customize the two kinds of word breakers: By_smart,by_max_word.
- Modify _default/5/doc.json: Remove all the fields of the word breaker; Analyzer: "xxx", because there is only one field that needs word breaker that is content (the contents of the file), add word breakers to the content node. As follows:
" content " : { "type""text", " Analyzer ":"by_max_word"#添加此行 ... },
- The configuration is complete, the same start job again: Bin/fscrawler job_name
- Access 9100: You can see that index has been created, such as:
- Synonym query: I configured the synonyms in the tomato and tomato, in the/tmp/es folder, a file containing tomatoes and tomatoes, 9100 port with the following statement query:
{ "Query": { "Match": { "content":"Tomato" } }, "Highlight": { "Pre_tags": [ "<tag1>", "<tag2>" ], "Post_tags": [ "</tag1>", "</tag2>" ], " Fields": { "content": {} } }}
The results are as follows:
{ "hits": [ { "_index":"Jb_8", "_type":"Doc", "_id":"3a15a979b4684d8a5d86136257888d73", "_score": 0.49273878, "_source": { "content":"I like to eat tomato and egg noodles. And I like tomatoes, scrambled eggs, rice .", "Meta": { "Raw": { "x-parsed-by":"Org.apache.tika.parser.DefaultParser", "content-encoding":"UTF-8", "Content-type":"Text/plain;charset=utf-8" } }, "file": { "extension":"txt", "Content_Type":"Text/plain;charset=utf-8", "last_modified":"2017-05-24t10:22:31", "indexing_date":"2017-05-25t14:08:10.881", "filesize": 55, "filename":"Sy.txt", "URL":"file:///tmp/es/sy.txt" }, "Path": { "encoded":"824b64ab42d4b63cda6e747e2b80e5", "Root":"824b64ab42d4b63cda6e747e2b80e5", "Virtual":"/", "Real":"/tmp/es/sy.txt" } }, "Highlight": { "content": [ "I like to eat <tag1> tomato </tag1> egg noodles. Also like <tag1> tomato </tag1> scrambled egg rice" ] } } ]}
The complete IK word synonym filter is configured to complete.
- In the txt,html format, other formats are available, but the file names are garbled in Chinese.
Attention:
To select the version of fs2.2, the 2.1 version fails to connect on the es of 5.3.1.
[Big Data]-fscrawler import file (Txt,html,pdf,worf ... ) to Elasticsearch5.3.1 and configure synonym filtering