When trying to convert a PDF into a string, first use Python's pdfminer and pdfminer3k to try the conversion, and then the data do not understand, then try to use Java,
The following is a pdf-to-string function written by Java PDFBox (the main function is not posted, a global function that is used directly) needs to be added to a package that has
Baidu Search PDFBox to the official website to download a put in Lib on the line
Then the most important breakthrough is to be able to generate the original format of the chaotic string processing into a comparison can see the string
The effect is as follows:
There is no pre-conversion output format as follows
Post-conversion format:
The code is as follows:
Import java.io.*;
Import org.apache.pdfbox.pdmodel.PDDocument;
Import Org.apache.pdfbox.text.PDFTextStripper;
public static string GetText (String file) throws Exception {
Whether to sort
Boolean sort = false;
Local path or URL of the PDF
String pdffile = file;
Encoding method
String encoding = "UTF-8";
Start extracting pages
int startpage = 1;
End Fetch Pages
int endpage = Integer.max_value;
File input stream, generating a text file
Writer output = null;
In-Memory stored PDF Document
PDDocument document = null;
try{
Extract text Using Pdftextstripper
Pdftextstripper stripper = new Pdftextstripper ();
Set whether to sort
Stripper.setsortbyposition (sort);
Set Start Page
Stripper.setstartpage (StartPage);
Set End page
Stripper.setendpage (EndPage);
String Text = stripper.gettext (document);
Try replacing a newline character with a white character in front or behind with another text, then replace the newline character with a newline character, and then change the text to a line break
The principle is that the PDF into a string has too many carriage returns in the middle of the line character \ r \ n This, but if the newline character before and after the text (not empty), then this should be a PDF forcibly wrapped out
Text = Text.replaceall ("\\r\\n\\s", "jacck"); The JACCK here is best replaced by a more complex text, as intermediate substitutions exist as far as possible in the intermediate conversion process and there are no matches in the document
Text = Text.replaceall ("\\s\\r\\n", "jacck");
Text = Text.replaceall ("\\n|\\r", ""); Handle the carriage return line break that was forcibly added
Text = Text.replaceall ("jacck", "\ r \ n");
return text;
Stripper.writetext (document, output);
}catch (Exception e) {
E.printstacktrace ();
}finally{
if (document! = NULL) {
Document.close ();
}
}
Return "";
}
Java PDF to string and fix format