Use Tika, Luke tools to parse multiple types (word, PDF, TXT, etc.) index files

Source: Internet
Author: User

Tika is a project of Apache that was produced in 2008 and is used primarily to open various types of documents and get their textual information. Can parse multiple types (word, PDF, txt, HTML, etc.) files! You can even get its web page information by parsing the URL. Finally, the text information is brought out. This aspect tika a bit like jsoup. . In general, it is wrong to directly create an index directly to a file such as Word, PDF, etc., and after viewing it with the Luke Tool, there is a big, messy term. At this point, you can use Tika to convert the text information before you create an index on it.

Package Hhc;import Java.io.file;import Java.io.fileinputstream;import java.io.ioexception;import Java.io.inputstream;import Org.apache.lucene.document.document;import Org.apache.lucene.document.field;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.apache.lucene.util.version;import Org.apache.tika.tika;import Org.apache.tika.exception.tikaexception;import Org.apache.tika.metadata.metadata;import Org.apache.tika.parser.autodetectparser;import Org.apache.tika.parser.parsecontext;import Org.apache.tika.parser.parser;import Org.apache.tika.sax.bodycontenthandler;import Org.xml.sax.contenthandler;import Com.chenlb.mmseg4j.analysis.mmseganalyzer;public class Indexutil {public void index (Boolean hasnew) throws IOException {File F=new file ("E:\\lucene\\learn\\example_tika");D irectory Directory = fsdirectory.open (new file ("e:\\lucene\\ Learn\\index_tika ")); inDexwriter writer=new indexwriter (directory, new Indexwriterconfig (version.lucene_35, New Mmseganalyzer ())); if (HasNew ) {Writer.deleteall ();}            For (File file:f.listfiles ()) {document D=new document ();            D.add (New Field ("Content", tikaparsefiletostring (file), field.store.yes,field.index.analyzed));            D.add (New Field ("FileName", File.getname (), field.store.yes,field.index.not_analyzed));            D.add (New Field ("Path", File.getabsolutepath (), field.store.yes,field.index.not_analyzed));    D.add (New Field ("Size", string.valueof (File.length ()), field.store.yes,field.index.not_analyzed)); Writer.adddocument (d);} Writer.close (); }/** * Create Tika object directly, but not high efficiency * @param file * @return * @throws ioexception * @throws tikaexception */public static String Tika Autostring (file file) throws IOException, Tikaexception{tika tika=new Tika ();//tika.parse (stream, metadata); Set Summary return tika.parsetostring (file);} /** * This method is more efficient * @param file * @return */public static String tikaparsefiletostring(File file)  {//Automatically get the most suitable parser parser parser = new Autodetectparser (); InputStream stream = null;try {stream = new FileInputStream (file);// All the parsed content will be put into this inside handlercontenthandler handler = new Bodycontenthandler ();//load parser parsecontext context = new Parsecontext (); Context.set (Parser.class, Parser);//Get Profile Data Metadata data=new Metadata ();p Arser.parse (stream, Handler , data, context);/** * You can set the Profile Data property */data.set (data. AUTHOR, "Hu Hui");d ata.set (data. Resource_name_key, File.getname ()); System.out.println (Data.tostring ()); for (String name:data.names ()) {System.out.println (name);} return handler.tostring ();} catch (Exception e) {//Todo:handle exceptione.printstacktrace ();} Finally{if (Stream!=null) try {stream.close ();} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}} Return "";} public static void Main (string[] args) {tikaparsefiletostring (New File ("E:\\lucene\\learn\\example_tika\\ailk_offer_ Hu Hui _2014-06-26.pdf "));}}


Luke This tool is very powerful ,Luke is a query index tool, use must be aware that the version to be consistent with the Lucene version, or may not open the index information. Select the directory where the index is located, you can query and manipulate the appropriate index information, and in the Searche according to | Queryparser to query the corresponding information. You can also manage index information.

The following is the luke4.10.2 version:



Here are two ways to use Tika


Use Tika, Luke tools to parse multiple types (word, PDF, TXT, etc.) index files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.