It is easy for Luke to see the index information of Lucene and, of course, to view index information in SOLR and ES (based on Lucene implementations).
Before looking at the index, be aware of the Lucene version of the problem, the later version of Lucene with the older version of the Luke tool may not open.
Remember that previously with Luke can also achieve the function of index repair, will have the wrong segment segment deleted, before use Backup.
The use of Luke is followed by a complement.
Tika is a text extraction tool that extracts content from files such as Word,pdf,excel, and provides data sources for ES. Picture information can only analyze the title size, no need to record RGB color information.
Tika based on the "magic number" of the file to identify the document type and encoding type, Java-like class files are all beginning with CF BB. The standard document, based on the preceding byte, can be identified.
Tika in the identification of Chinese, there may be garbled, remember that the document mentions, it is possible that the GB2312 character set recognition has the problem of probability error. Have the opportunity to look at it specifically.
Lucene Index View tool Luke and text Extraction tool Tika