Parse PDF content using product_box

Source: Internet
Author: User
Tags gopher what inheritance

Relax first:

Interviewer
Interviewer: familiar with which language
Applicant: Java.
Interviewer: Do you know what a class is?
Candidate: I am really a hard-working person and don't know what it means
Interviewer: Do you know what a pack is?
Applicant: I don't need to take any bag or prepare for it.
Interviewer: Do you know what interfaces are?
Applicant: I am a serious employee. Never make excuses to be lazy
M: Do you know what inheritance is?
Y: I am an orphan and have nothing to inherit.
M: Do you know what an object is?
M: Yes, but I am very motivated by my hard work. I am not planning to find any target yet.
M: Do you know the polymorphism?
Y: Yes, I am very conservative. I think it is immoral to let a beloved woman have an abortion for her own pleasure!

Parse PDF content using product_box: In the following code, the gettext method receives a string-type parameter and specifies the path of the PDF file to be extracted. This location can be a URL or local file. Then, the function calls the javastextstripper class provided by javasbox to set some attributes (such as the start page and whether to sort the attributes) in the extraction process ). Finally, extract and write the text into the file.

Public void getext (string file) throws exception {
// Sort or not
Boolean sort = false;
// PDF file name
String pdffile = file;
// Enter the text file name
String textfile = NULL;
// Encoding method
String encoding = "UTF-8 ";
// Start page Extraction
Int startpage = 1;
// End number of extracted pages
Int endpage = integer. max_value;
// File input stream to generate a text file
Writer output = NULL;
// PDF document stored in memory
PDDocument document = null;
Try {
Try {
// First load the file as a URL, and then load the file from the local file system if an exception occurs. //
URL url = new URL (pdfFile );
Document = PDDocument. load (url );
// Obtain the PDF file name
String fileName = url. getFile ();

// Name the generated txt file with the original PDF name
If (fileName. length ()> 4 ){
File outputFile = new File (fileName. substring (0, fileName. length ()
-4) + ". txt ");
TextFile = outputFile. getName ();
}
} Catch (MalformedURLException e ){

// If an exception occurs during URL loading, it will be loaded from the file system
Document = PDDocument. load (pdfFile );
If (pdfFile. length ()> 4 ){
TextFile = pdfFile. substring (0, pdfFile. length ()-4) + ". txt ";
}
}
// File input stream, which is written into the file inverted textFile
Output = new OutputStreamWriter (new FileOutputStream (textFile ),
Encoding );
// Extract textstripper to extract text
Optional textstripper stripper = null;
Stripper = new jsontextstripper ();
// Set whether to sort
Stripper. setSortByPosition (sort );
// Set the start page
Stripper. setStartPage (startPage );
// Set the end page
Stripper. setEndPage (endPage );
// Call javastextstripper's writeText to extract and output the text
Stripper. writeText (document, output );
} Finally {
If (output! = Null ){
// Close the output stream
Output. close ();
}
If (document! = NULL ){
// Close the PDF document
Document. Close ();
}
}
}

Add Main Function

Public static void main (string [] ARGs ){
Export boxtest test = new export boxtest ();
Try {
// Obtain the content of indexstores in the C drive
Test. getext ("C:/index.html ");
} Catch (exception e ){
E. printStackTrace ();
}
}

Introduce the package too, saving trouble

Import java. io. File;
Import java. io. FileOutputStream;
Import java. io. OutputStreamWriter;
Import java. io. Writer;
Import java.net. MalformedURLException;
Import java.net. URL;

Import org. Apache. Lucene. analysis. Standard. standardanalyzer;
Import org.apache.e.doc ument. Document;
Import org. Apache. Lucene. Index. indexwriter;
Import org. Apache. Lucene. Index. term;
Import org. Apache. Lucene. Search. Hits;
Import org. Apache. Lucene. Search. indexsearcher;
Import org. Apache. Lucene. Search. query;
Import org. Apache. Lucene. Search. termquery;
Import orgdomainbox. pdmodel. pddocument;
Import org‑box. searchengine. Lucene. javase‑document;
Import orgdomainbox. util. extends textstripper;

Review the usage of the File class

Public File (String pathname) Creates a new File instance by converting the given pathname string into an abstract pathname. If the given string is the empty string, then the result is the empty abstract pathname.

Parameters:
Pathname-A pathname string

Public File (URI uri) Creates a new File instance by converting the given file: URI into an abstract pathname.
The exact form of a file: URI is system-dependent, hence the transformation stored med by this constructor is also system-dependent.

For a given abstract pathname f it is guaranteed that

New File (f. toURI (). equals (f. getAbsoluteFile ())
So long as the original abstract pathname, the URI, and the new abstract pathname are all created in (possibly different invocations of) the same Java virtual machine. this relationship typically does not hold, however, when a file: URI that is created in a virtual machine on one operating system is converted into an abstract pathname in a virtual machine on a different operating system.

Parameters:
Uri-An absolute, hierarchical URI with a scheme equal to "file", a non-empty path component, and undefined authority, query, and fragment components

Many people say that xpdf is better than product_box, but I personally think it is more practical!

OK!

URL (Uniform Resoure Locator: Uniform Resource Locator) is the address of the WWW page. It consists of the following parts from left to right:

· Internet resource type (scheme): indicates the tool that WWW client programs use to operate. For example, "http: //" indicates the WWW server, "ftp: //" indicates the FTP server, "gopher: //" indicates the Gopher server, and "new:" indicates the Newgroup newsgroup.

· Server address (host): The domain name of the server where the WWW page is located.

· Port: Sometimes (not always like this). For access to some resources, the corresponding server must provide the port number.

· Path: Specifies the location of a resource on the server (the format is the same as that in the DOS system, which usually consists of a directory/subdirectory/file name structure ). Like a port, the path is not always required.

URL address format: scheme: // host: port/path, for example, http://www.sohu.com/domain/hxwzyour URL address.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.