About the PDF framework used by the Xsupermes project

Source: Internet
Author: User

I used in the previous project is PDFBox, in reading Chinese documents can read most of the text, but in numbers, paging and other places are inevitably garbled. So I search on the internet to see if there is any solution, see there is saying:

"PDFBox looks very handy and its API is powerful. It can even be seamlessly combined with Lucene. But it has a fatal weakness, which is that it does not support Chinese. To extract the text in Chinese, you can use another very good tool, xpdf. ”

So I decided to compare the two methods of processing Chinese PDF documents in time performance, read effects and other aspects of the effect.

I. About Xpdf and PDFBox

1.xpdf
Xpdf is just a software that takes command line calls through Java and gets the output, so it's simple to use, but quite limited, such as the inability to cross the platform, the inability to process specific formats (tables, etc.), and the inability to process pictures and other attachments. Such a call must limit its flexibility.

2.pdfbox
PDFBox (an open source project under the BSD license) is a pure Java class Library prepared for developers to read and create PDF documents.


Ii. test Results and summary

1.txt File Size
A PDF document of size 74KB is processed as a TXT document, and the size is significantly reduced. A txt file of 10KB was generated after processing by xpdf. A txt file of 12KB was generated after processing by PDFBox.


2. Time Performance
Judging from the running results, the xpdf processing speed is significantly faster than the PDFBox processing speed, almost 1/10.

3. Analysis and summary

From the above results, it can be found that xpdf is better than PDFBox in time performance and space performance. On the most critical reading effect, PDFBox will automatically add some formatting to some reading text, such as carriage return, space, etc., resulting in a more bad effect. As for the format of some PDF documents can be read garbled, the two methods read the garbled is more consistent, this should be due to the shortcomings of these two methods.

It is shown that the use of xpdf should be a better approach if it is not required to be highly transplanted.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

About the PDF framework used by the Xsupermes project

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.