Java reads PDF and MS Office documents

Last Update:2017-01-24 Source: Internet

Author: User

Tags gettext

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sometimes the text in the PDF cannot be copied, which may be because the PDF file is encrypted, but it can be read using PDFBox open source software.

There is also a project----IText for creating PDF files.

PDFBox There are two sub-items: Fontbox is a Java class library that handles PDF fonts; Jempbox is a Java class library that handles XMP metadata.

A simple example:

To introduce Pdfbox-app-1.6.0.jar this package.

Package PDF;

Import Java.io.File;
Import java.net.MalformedURLException;

Import org.apache.pdfbox.pdmodel.PDDocument;
Import Org.apache.pdfbox.util.PDFTextStripper;

public class Strippdfcontent {

public static String getText (file file) throws exception{
Boolean sort=false;
int startpage=1;
int endpage=10;
PDDocument Document=null;
try{
try{
Document=pddocument.load (file);
}catch (Malformedurlexception e) {
                
}
Pdftextstripper stripper=new pdftextstripper ();
Stripper.setsortbyposition (sort);
Stripper.setstartpage (StartPage);
Stripper.setendpage (EndPage);
return Stripper.gettext (document);
}catch (Exception e) {
E.printstacktrace ();
Return "";
}finally{
if (document!=null) {
Document.close ();
}
}
}
    
public static void Main (string[] args) {
File File=new file ("/home/orisun/123.pdf");
try{
String cont=gettext (file);
SYSTEM.OUT.PRINTLN (cont);
}catch (Exception e) {
System.out.println ("Strip failed.");
E.printstacktrace ();
}
}
}

Apache's POI project can be used to process MS Office documents, and there's a. NET version of CodePlex. The POI project can create and maintain operations on various Java APIs based on the Ooxml and OLE2 file formats. Most MS Office is in OLE2 format. The POI pass HSMF Sub-project supports Outlook through the HDGF subproject to support Visio and supports publisher through the HPBF subproject.

A simple example of extracting word using poi:

To introduce both the Poi-3.7.jat and POI-SCRATCHPAD-3.7.AJR packages.

Package MSOffice;

Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.IOException;
Import Java.io.InputStream;

Import org.apache.poi.hwpf.HWPFDocument;
Import Org.apache.poi.hwpf.extractor.WordExtractor;
Import Org.apache.poi.hwpf.usermodel.CharacterRun;
Import Org.apache.poi.hwpf.usermodel.Paragraph;
Import Org.apache.poi.hwpf.usermodel.Range;
Import org.apache.poi.hwpf.usermodel.Section;

public class Word {

Extract all content directly
public static String ReadDoc1 (InputStream is) throws IOException {
Wordextractor extractor = new Wordextractor (IS);
return Extractor.gettext ();
}
    
Chapter section, paragraph paragraph, string Characterrun extraction
public static void ReadDoc2 (InputStream is) throws IOException {
Hwpfdocument doc=new hwpfdocument (IS);
Range R=doc.getrange ();
for (int x=0;x<r.numsections (); x + +) {
Section s=r.getsection (x);
for (int y=0;y<s.numparagraphs (); y++) {
Paragraph p=s.getparagraph (y);
for (int z=0;z<p.numcharacterruns (); z++) {
Characterrun Run=p.getcharacterrun (z);
String Text=run.text ();
System.out.print (text);
}
}
}
}

public static void Main (string[] args) {
File File = new file ("/home/orisun/1.doc");
try {
FileInputStream fin = new FileInputStream (file);
String cont = READDOC1 (FIN);
SYSTEM.OUT.PRINTLN (cont);
Fin.close ();
Fin = new FileInputStream (file);
READDOC2 (Fin);
Fin.close ();
} catch (IOException e) {
E.printstacktrace ();
}
}
}
Poi Extract PPT Example:

Package MSOffice;

Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.IOException;
Import Java.io.InputStream;

Import Org.apache.poi.hslf.HSLFSlideShow;
Import Org.apache.poi.hslf.extractor.PowerPointExtractor;
Import Org.apache.poi.hslf.model.Slide;
Import Org.apache.poi.hslf.model.TextRun;
Import Org.apache.poi.hslf.usermodel.SlideShow;

public class PPT {

Extract the full contents of a slide directly
public static String ReadDoc1 (InputStream is) throws ioexception{
Powerpointextractor extractor=new Powerpointextractor (IS);
return Extractor.gettext ();
}
    
One slide to read a slide
public static void ReadDoc2 (InputStream is) throws ioexception{
Slideshow Ss=new Slideshow (new Hslfslideshow (IS));
Slide[] Slides=ss.getslides ();
for (int i=0;i<slides.length;i++) {
Read the title of a slide
String Title=slides[i].gettitle ();
System.out.println ("title:" +title);
Read the contents of a slide (including the title)
Textrun[] Runs=slides[i].gettextruns ();
for (int j=0;j<runs.length;j++) {
System.out.println (Runs[j].gettext ());
}
}
}
    
public static void Main (string[] args) {
File File = new file ("/home/orisun/2.ppt");
try{
FileInputStream fin=new fileinputstream (file);
String Cont=readdoc1 (Fin);
SYSTEM.OUT.PRINTLN (cont);
Fin.close ();
Fin=new fileinputstream (file);
READDOC2 (Fin);
Fin.close ();
}catch (IOException e) {
E.printstacktrace ();
}
}
}

An Excel file consists of multiple workbook, and a workbook consists of multiple sheet.

A simple example of POI extraction Excel:

Package MSOffice;

Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.IOException;
Import Java.io.InputStream;
Import Java.util.Iterator;

Import Org.apache.poi.hssf.usermodel.HSSFCell;
Import Org.apache.poi.hssf.usermodel.HSSFRow;
Import Org.apache.poi.hssf.usermodel.HSSFSheet;
Import Org.apache.poi.hssf.usermodel.HSSFWorkbook;
Import Org.apache.poi.hssf.extractor.ExcelExtractor;
Import Org.apache.poi.poifs.filesystem.POIFSFileSystem;
Import Org.apache.poi.ss.usermodel.Row;

public class Excel {

Read all the contents of Excel directly
public static String ReadDoc1 (InputStream is) throws ioexception{
Hssfworkbook wb=new Hssfworkbook (New Poifsfilesystem (IS));
Excelextractor extractor=new excelextractor (WB);
Extractor.setformulasnotresults (FALSE);
Extractor.setincludesheetnames (TRUE);
return Extractor.gettext ();
}
    
Thinning to sheet, rows, and even cells while reading
public static double Getavg (InputStream is) throws ioexception{
Hssfworkbook wb=new Hssfworkbook (New Poifsfilesystem (IS));
Get the first piece of sheet
Hssfsheet sheet=wb.getsheetat (0);
Double molecule=0.0;
Double denominator=0.0;
Traverse sheet by row
Iterator<row> Riter=sheet.rowiterator ();
while (Riter.hasnext ()) {
Hssfrow row= (Hssfrow) riter.next ();
Hssfcell Cell1=row.getcell (4);
Hssfcell Cell2=row.getcell (4);
if (Cell1.getcelltype ()!=hssfcell.cell_type_numeric) {
System.err.println ("Numeric type Error! ");
System.exit (-2);
}
if (Cell2.getcelltype ()!=hssfcell.cell_type_numeric) {
System.err.println ("Numeric type Error! ");
System.exit (-2);
}
Denominator+=double.parsedouble (cell2.tostring (). Trim ());
Molecule+=double.parsedouble (cell2.tostring (). Trim ()) *float.parsefloat (cell1.tostring (). Trim ());
}
return molecule/denominator;
}
    
public static void Main (string[] args) {
File File = new file ("/home/orisun/3.xls");
try{
FileInputStream fin=new fileinputstream (file);
String Cont=readdoc1 (Fin);
SYSTEM.OUT.PRINTLN (cont);
Fin.close ();
Fin=new fileinputstream (file);
SYSTEM.OUT.PRINTLN ("Weighted average score" +getavg (Fin));
Fin.close ();
}catch (IOException e) {
E.printstacktrace ();
}
}
}

Java reads PDF and MS Office documents

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More