Tika is a project of Apache that was produced in 2008 and is used primarily to open various types of documents and get their textual information. Can parse multiple types (word, PDF, txt, HTML, etc.) files! You can even get its web page information by parsing the URL. Finally, the text information is brought out. This aspect tika a bit like jsoup. . In general, it is wrong to directly create an index directly to a file such as Word, PDF, etc., and after viewing it with the Luke Tool, there is a big, messy term. At this point, you can use Tika to convert the text information before you create an index on it.
Package Hhc;import Java.io.file;import Java.io.fileinputstream;import java.io.ioexception;import Java.io.inputstream;import Org.apache.lucene.document.document;import Org.apache.lucene.document.field;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.apache.lucene.util.version;import Org.apache.tika.tika;import Org.apache.tika.exception.tikaexception;import Org.apache.tika.metadata.metadata;import Org.apache.tika.parser.autodetectparser;import Org.apache.tika.parser.parsecontext;import Org.apache.tika.parser.parser;import Org.apache.tika.sax.bodycontenthandler;import Org.xml.sax.contenthandler;import Com.chenlb.mmseg4j.analysis.mmseganalyzer;public class Indexutil {public void index (Boolean hasnew) throws IOException {File F=new file ("E:\\lucene\\learn\\example_tika");D irectory Directory = fsdirectory.open (new file ("e:\\lucene\\ Learn\\index_tika ")); inDexwriter writer=new indexwriter (directory, new Indexwriterconfig (version.lucene_35, New Mmseganalyzer ())); if (HasNew ) {Writer.deleteall ();} For (File file:f.listfiles ()) {document D=new document (); D.add (New Field ("Content", tikaparsefiletostring (file), field.store.yes,field.index.analyzed)); D.add (New Field ("FileName", File.getname (), field.store.yes,field.index.not_analyzed)); D.add (New Field ("Path", File.getabsolutepath (), field.store.yes,field.index.not_analyzed)); D.add (New Field ("Size", string.valueof (File.length ()), field.store.yes,field.index.not_analyzed)); Writer.adddocument (d);} Writer.close (); }/** * Create Tika object directly, but not high efficiency * @param file * @return * @throws ioexception * @throws tikaexception */public static String Tika Autostring (file file) throws IOException, Tikaexception{tika tika=new Tika ();//tika.parse (stream, metadata); Set Summary return tika.parsetostring (file);} /** * This method is more efficient * @param file * @return */public static String tikaparsefiletostring(File file) {//Automatically get the most suitable parser parser parser = new Autodetectparser (); InputStream stream = null;try {stream = new FileInputStream (file);// All the parsed content will be put into this inside handlercontenthandler handler = new Bodycontenthandler ();//load parser parsecontext context = new Parsecontext (); Context.set (Parser.class, Parser);//Get Profile Data Metadata data=new Metadata ();p Arser.parse (stream, Handler , data, context);/** * You can set the Profile Data property */data.set (data. AUTHOR, "Hu Hui");d ata.set (data. Resource_name_key, File.getname ()); System.out.println (Data.tostring ()); for (String name:data.names ()) {System.out.println (name);} return handler.tostring ();} catch (Exception e) {//Todo:handle exceptione.printstacktrace ();} Finally{if (Stream!=null) try {stream.close ();} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}} Return "";} public static void Main (string[] args) {tikaparsefiletostring (New File ("E:\\lucene\\learn\\example_tika\\ailk_offer_ Hu Hui _2014-06-26.pdf "));}}
Luke This tool is very powerful ,Luke is a query index tool, use must be aware that the version to be consistent with the Lucene version, or may not open the index information. Select the directory where the index is located, you can query and manipulate the appropriate index information, and in the Searche according to | Queryparser to query the corresponding information. You can also manage index information.
The following is the luke4.10.2 version:
Here are two ways to use Tika
Use Tika, Luke tools to parse multiple types (word, PDF, TXT, etc.) index files