Java PDF to string and fix format

Source: Internet
Author: User
Tags gettext

When trying to convert a PDF into a string, first use Python's pdfminer and pdfminer3k to try the conversion, and then the data do not understand, then try to use Java,

The following is a pdf-to-string function written by Java PDFBox (the main function is not posted, a global function that is used directly) needs to be added to a package that has

Baidu Search PDFBox to the official website to download a put in Lib on the line

Then the most important breakthrough is to be able to generate the original format of the chaotic string processing into a comparison can see the string

The effect is as follows:

There is no pre-conversion output format as follows

Post-conversion format:

The code is as follows:

Import java.io.*;
Import org.apache.pdfbox.pdmodel.PDDocument;
Import Org.apache.pdfbox.text.PDFTextStripper;
public static string GetText (String file) throws Exception {
Whether to sort
Boolean sort = false;
Local path or URL of the PDF
String pdffile = file;
Encoding method
String encoding = "UTF-8";
Start extracting pages
int startpage = 1;
End Fetch Pages
int endpage = Integer.max_value;
File input stream, generating a text file
Writer output = null;
In-Memory stored PDF Document
PDDocument document = null;
try{

Extract text Using Pdftextstripper
Pdftextstripper stripper = new Pdftextstripper ();
Set whether to sort
Stripper.setsortbyposition (sort);
Set Start Page
Stripper.setstartpage (StartPage);
Set End page
Stripper.setendpage (EndPage);
String Text = stripper.gettext (document);

Try replacing a newline character with a white character in front or behind with another text, then replace the newline character with a newline character, and then change the text to a line break
The principle is that the PDF into a string has too many carriage returns in the middle of the line character \ r \ n This, but if the newline character before and after the text (not empty), then this should be a PDF forcibly wrapped out
Text = Text.replaceall ("\\r\\n\\s", "jacck"); The JACCK here is best replaced by a more complex text, as intermediate substitutions exist as far as possible in the intermediate conversion process and there are no matches in the document
Text = Text.replaceall ("\\s\\r\\n", "jacck");
Text = Text.replaceall ("\\n|\\r", ""); Handle the carriage return line break that was forcibly added
Text = Text.replaceall ("jacck", "\ r \ n");

return text;
Stripper.writetext (document, output);
}catch (Exception e) {
E.printstacktrace ();
}finally{
if (document! = NULL) {
Document.close ();
}
}
Return "";
}

Java PDF to string and fix format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.