Parse PDF files using product_box

Source: Internet
Author: User

Parse PDF files using product_box

Today, we are going to add a PDF Processing Function in the Nutch source code. The step to do this is to extract the text information in the PDF document. After consideration, we are still ready to use the product_box. As you can see, the parse-tika plug-in the source code of Nutch has a product_box, but it is 1.1.0, which cannot be processed by many PDF documents. Now the latest version on the official website is 1.6.0, so I want to replace it. Because I do not like to read English instructions, it is a great discount.

At first I only downloaded the pdfbox-1.6.0.jar and replaced the old jar package. The Program reported an error. However, I carefully read the official documentation. The depandencies column on the official website of Consumer box (http://pdfbox.apache.org/) clearly specifies the components required to use consumer box and their associations. Consumer box has three main components, in addition to the above pdfbox-1.6.0.jar, fontbox-1.6.0.jar and jempbox-1.6.0.jar, in addition to a log processing commons-logging component. For the log component, there is already in the Nutch, is commons-logging-1.0.4.jar and

Commons-logging-api-1.0.4.jar, if you use consumer box in your own application, you need the five jar packages above (the log component is two jar packages ).

Of course, in order to facilitate the use of the official website, also provides an integrated jar package: pdfbox-app-1.6.0.jar, if you use this jar package, no longer need other.

Use product_box to process PDF documents

OK. When everything is ready, extract text information. The code for extracting text information is relatively simple, and there are also a lot of code on the Internet. Example:

PDDocument doc = PDDocument. load ("D:/331.20 ");

Required textstripper stripper = new required textstripper ();

String text = stripper. getText (doc );

String title = stripper. getTitle (doc );

This is to read a PDF file from the local machine. If it is from the network, you will first get an InputStream object of the file (assuming it is called stream). The Code is as follows:

PDDocument doc = new PDDocument ();

Extends parser = new extends parser (stream );

Parser. parse ();

Doc = parser. getPDDocument ();

Required textstripper stripper = new required textstripper ();

String text = stripper. getText (doc );

String title = stripper. getTitle (doc );

But it should be noted;

(1) product_box cannot extract PDF files in some formats, but most of them can.

(2) PDFTextStripper tries to extract more information, such as titles and summaries. But do not count on this type too much. Only standard PDF documents (such as papers) can be extracted. The remaining values are either null or incorrect.

Consumer box also has many other functions, such as decoding and so on. If necessary, study the API ......

Product_box details: click here
Product_box: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.