Lucene reads Word, Excel, PDF

Source: Internet
Author: User

When I first started writing Lucene, the example only allowed to create indexes for TXT documents, but not for Word, Excel, and PDF documents. to read the content of these documents, an additional jar package is required, fortunately, Apache is an open-source organization that provides open-source jar packages for parsing these documents.

 

I will not write any more indexes and queries.ArticleYes. Only the reading methods of these three documents will be pasted below.

 

1. First, let's look at the word document:

Here is the use of poi, the relevant jar package (http://poi.apache.org/) can be downloaded to the Apache official website, and then added to the project (the following jar package is also used, do not repeat ). A POI. jar doesn't work, but you still need to import the poi-scratchpad.jar package

 
Public static string readword (string path) {stringbuffer content = new stringbuffer (""); // The document content try {hwpfdocument Doc = new hwpfdocument (New fileinputstream (PATH )); range = Doc. getrange (); int paragraphcount = range. numparagraphs (); // section for (INT I = 0; I <paragraphcount; I ++) {// traverse the section to read data paragraph pp = range. getparagraph (I); content. append (pp. text () ;}} catch (exception e) {e. printstacktrace ();} return content. tostring (). trim ();}

 

2. Check the Excel document again:

The jxl package is used here, but the jxl package (Http://www.andykhan.com/jexcelapi/) Currently, it does not support version 2007 or later, but poi can. Now I believe that the open-source version is powerful. SOLR released version March this year in December 3.1 and version May in December 3.2, we can see the update speed.

The following example shows how to use the jxl package to read excel2003. If you are interested, you can check it and use poi to read the Excel version 07. It seems that you have to add many associated jar packages.

 

Public static string readexcel (string path) throws exception {fileinputstream FD = new fileinputstream (PATH); stringbuilder sb = new stringbuilder (); jxl. workbook RWB = workbook. getworkbook (FCM); sheet [] Sheet = RWB. getsheets (); For (INT I = 0; I <sheet. length; I ++) {sheet rs = RWB. getsheet (I); For (Int J = 0; j <Rs. getrows (); j ++) {Cell [] cells = Rs. getrow (j); For (int K = 0; k <cells. length; k ++) Sb. append (cells [K]. getcontents ();} fiis. close (); return sb. tostring ();}

3. Finally, let's take a look at the PDF document:

Here is the consumer box, the relevant jar package can go to the Apache official website download: http://pdfbox.apache.org/download.html

Note that if you only import the product_box. jar package, the error will also occur, you also need to import the commons-logging.jar and fontbox. jar package.

 
Public static string readpdf (string path) throws exception {stringbuffer content = new stringbuffer (""); // file content fileinputstream FD = new fileinputstream (PATH ); partition parser P = new partition Parser (FCM); p. parse (); jsontextstripper Ts = new jsontextstripper (); content. append (TS. gettext (P. getpddocument (); FCM. close (); Return content. tostring (). trim ();}

If an exception is thrown when a PDF document is extracted: Java. Lang. throwable: Warning: You did not close the PDF document, see the following information:

1. http://lqw.iteye.com/blog/721568

2. http://blog.csdn.net/rxr1st/article/details/2204460

 

On the SOLR official website, you can see:

Rich document parsing and indexing (PDF, word, HTML, etc) using Apache Tika

Tika seems to include some parsing jar files such as poi and Consumer box. Let's take a look at how to parse PDF files in SOLR. It is estimated that it depends on the configuration file.

References:

1. http://blog.163.com/lewutian@126/blog/static/163824796201041131910140/

2. http://blog.csdn.net/iamwangbao/archive/2009/11/04/4767387.aspx

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.