I used in the previous project is PDFBox, in reading Chinese documents can read most of the text, but in numbers, paging and other places are inevitably garbled. So I search on the internet to see if there is any solution, see there is saying:
"PDFBox looks very handy and its API is powerful. It can even be seamlessly combined with Lucene. But it has a fatal weakness, which is that it does not support Chinese. To extract the text in Chinese, you can use another very good tool, xpdf. ”
So I decided to compare the two methods of processing Chinese PDF documents in time performance, read effects and other aspects of the effect.
I. About Xpdf and PDFBox
1.xpdf
Xpdf is just a software that takes command line calls through Java and gets the output, so it's simple to use, but quite limited, such as the inability to cross the platform, the inability to process specific formats (tables, etc.), and the inability to process pictures and other attachments. Such a call must limit its flexibility.
2.pdfbox
PDFBox (an open source project under the BSD license) is a pure Java class Library prepared for developers to read and create PDF documents.
Ii. test Results and summary
1.txt File Size
A PDF document of size 74KB is processed as a TXT document, and the size is significantly reduced. A txt file of 10KB was generated after processing by xpdf. A txt file of 12KB was generated after processing by PDFBox.
2. Time Performance
Judging from the running results, the xpdf processing speed is significantly faster than the PDFBox processing speed, almost 1/10.
3. Analysis and summary
From the above results, it can be found that xpdf is better than PDFBox in time performance and space performance. On the most critical reading effect, PDFBox will automatically add some formatting to some reading text, such as carriage return, space, etc., resulting in a more bad effect. As for the format of some PDF documents can be read garbled, the two methods read the garbled is more consistent, this should be due to the shortcomings of these two methods.
It is shown that the use of xpdf should be a better approach if it is not required to be highly transplanted.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
About the PDF framework used by the Xsupermes project