The Java parsing pdf file (PDFBox, itext parsing pdf) Exports the child pictures in the PDF and removes the watermark from the PDF __java

Source: Internet
Author: User
Tags pdfobject wrapper

Some time ago, in order to parse PDFs, it took a lot of time to learn PDFBox and Itext, both of which are open source libraries for working with PDFs, both Java and C #. As a new beginning to learn these two open source Library, the feeling of the resources on Baidu is still too little. I do is a PDF processing, in Baidu for a long time did not find the answer, and finally to Itext's official website and stack overflow found the answer. The last comparison, PDFBox and itext relatively, itext function is much stronger, I compared itext and PDFBox processing PDF file when the speed itext to a little faster, and Itext official website gives examples and some problems (these problems are from stack Overflow the above question), so I finally chose the ITEXT,ITEXT5 example this link is the official example of ITEXT5, one of the most important things to note is that different versions of Itext have different codes for solving the same problem. Personally feel the itext7 relative to the itext5 change quite big. Some of my examples below are itext5.5.11 versions of the link to download the itext5.5.11 jar itext5.5.11 jar download, itext dll, which downloads itext5.5.11 DLL packages. In fact, Java and C # are the same when using Itext. I'll give you some examples of PDF operations below. Are some examples of processing PDFs, for how to make a PDF file can go to itext website to find information. Processing PDF files in Itext is encapsulated by a Dictionary object, which is the same as the PDF structure, and can be referenced in PDF reference.

One, export the child picture in PDF

public static void Extractimage (String filename) {Pdfreader reader = null;
			try {//read PDF file reader = new pdfreader (filename);	
			Get PDF file number of pages int sumpage = Reader.getnumberofpages (); Read each page in the PDF file for (int i = 1;i <= sumpage;i++) {//Get the Dictionary object for each page of the PDF pdfdictionary dictionary = reader.getpagen (i
				); Get the corresponding Dictionary object through resources pdfdictionary res = (pdfdictionary) pdfreader.getpdfobject (Dictionary.get (pdfname.resources)
				Get Xobject Picture Object pdfdictionary xobj = (pdfdictionary) pdfreader.getpdfobject (Res.get (pdfname.xobject)); if (xobj!= null) {for (Iterator it = Xobj.getkeys (). iterator (); It.hasnext ();) {Pdfobject obj = Xobj.get (pdfname			
						) ());					
							if (Obj.isindirect ()) {pdfdictionary TG = (pdfdictionary) pdfreader.getpdfobject (obj);
							Pdfname type = (pdfname) pdfreader.getpdfobject (Tg.get (Pdfname.subtype));
								if (PdfName.IMAGE.equals (type)) {Pdfobject object = reader.getpdfobject (obj); if (objEct.isstream ()) {Prstream Prstream = (Prstream) object;
									Byte[] B;
									try{B = reader.getstreambytes (Prstream);
									}catch (Unsupportedpdfexception e) {b = Reader.getstreambytesraw (Prstream);
									FileOutputStream output = new FileOutputStream (String.Format ("d:/pdf/output%d.jpg", I));
									Output.write (b);
									Output.flush ();								
								Output.close (); catch (IOException e) {//TODO auto-generated catch block E.printstack
		Trace (); }

This example is I found in Baidu, but not complete, this is the export of the PDF in the picture, I will add him complete. There is a problem with this program and the pictures you can export for some PDF files cannot be opened.

second, the removal of the PDF file watermark font

/** * <a href= "Http://" > * Removing watermark from PDF itextsharp * </a> * <p> * This class presents a simple content stream editing F Ramework. As is it creates a equivalent * copy of the original page content stream.  To actually edit, simply overwrite the method * {@link #write (pdfcontentstreamprocessor, Pdfliteral, List)}
 This class) write * The given operations as they are but change them in some fancy. * </p> * * @author MKL/public class Pdfcontentstreameditor extends Pdfcontentstreamprocessor {public stat
			IC void Main (string[] args) {try {pdfreader reader = new Pdfreader ("Input.pdf");
			OutputStream result = new FileOutputStream (New File ("Out.pdf"));
			Pdfstamper pdfstamper = new Pdfstamper (reader, result);
			Pdfcontentstreameditor identityeditor = new Pdfcontentstreameditor ();
		for (int i = 1;i <= reader.getnumberofpages (); i++) {		Identityeditor.editpage (Pdfstamper, i);
		} pdfstamper.close ();
		catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
		catch (Documentexception e) {//TODO auto-generated catch block E.printstacktrace ();
     }/** * This method edits the immediate contents of a page, i.e. its content stream.
     * It explicitly does not descent into form xobjects, patterns, or annotations. */public void Editpage (pdfstamper pdfstamper, int pagenum) throws IOException {Pdfreader Pdfreader = PDF
        Stamper.getreader ();
        Pdfdictionary page = Pdfreader.getpagen (pagenum);
        byte[] Pagecontentinput = Contentbyteutils.getcontentbytesforpage (Pdfreader, pagenum);
        Page.remove (pdfname.contents);
    Editcontent (Pagecontentinput, Page.getasdict (pdfname.resources), Pdfstamper.getundercontent (PageNum));
     /** * This method processes the content bytes and outputs to the given canvas. * It Explicitly does not descent into form xobjects, patterns, or annotations. * * public void editcontent (byte[] contentbytes, pdfdictionary resources, Pdfcontentbyte canvas) {This.can
        vas = canvas;
        ProcessContent (contentbytes, resources);
    This.canvas = null; /** * <p> * This is writes the content stream operations to the target canvas. The default * implementation writes them as they come, so it essentially generates identical * copies of the Ori
     Ginal instructions The {@link Contentoperatorwrapper} instances * forward to it.
     * </p> * <p> * Override to achieve some fancy editing effect. * </p> * protected void Write (Pdfcontentstreamprocessor processor, pdfliteral operator, List<pdfobject

        > operands) throws ioexception {int index = 0; for (Pdfobject object:operands) {object.topdf Canvas.getpdfwriter (), Canvas.getinternalbuffer ()); Canvas.getinternalbuffer (). Append (Operands.size () > ++index? (byte)
        ': (byte) ' \ n '); }////constructor giving the parent a dummy listener to talk to//public pdfcontentstreameditor
    () {Super (New Dummyrenderlistener ()); ////Overrides of Pdfcontentstreamprocessor methods//@Override public contentoperator Registerco Ntentoperator (String operatorstring, contentoperator operator) {Contentoperatorwrapper wrapper = new Contento
        Peratorwrapper ();
        Wrapper.setoriginaloperator (operator);
        Contentoperator formeroperator = Super.registercontentoperator (operatorstring, wrapper); Return formeroperator instanceof Contentoperatorwrapper?
    ((Contentoperatorwrapper) formeroperator). Getoriginaloperator (): Formeroperator; @Override public void ProcessContent (byte[] contentbytes, pdfdictionary resources) {This.resources = resources;
        Super.processcontent (contentbytes, resources);
    This.resources = null;
    ////Members holding the output canvas and the protected pdfcontentbyte canvas = null;
    protected pdfdictionary resources = NULL; A content operator class to wrap all content operators to forward the invocation to the editor/class
            Contentoperatorwrapper implements Contentoperator {public Contentoperator getoriginaloperator () {
        return originaloperator;  } public void Setoriginaloperator (Contentoperator originaloperator) {this.originaloperator =
        Originaloperator; @Override public void Invoke (Pdfcontentstreamprocessor processor, pdfliteral operator, arraylist<pdf Object> operands) throws Exception {if (originaloperator!= null &&!)
              Do ". Equals (Operator.tostring ())) {  Originaloperator.invoke (processor, operator, operands);
        Write (processor, operator, operands);
    Private Contentoperator originaloperator = null; ////A dummy render listener to give to the underlying content stream processor to feeds events to//s 

        Tatic class Dummyrenderlistener implements Renderlistener {@Override public void Begintextblock () {} @Override public void RenderText (Textrenderinfo renderinfo) {} @Override public void E Ndtextblock () {} @Override public void RenderImage (Imagerenderinfo renderinfo) {}}}
The above class is officially given as a tool class

public static void Main (string[] args) {try {pdfreader pdfreader = new Pdfreader ("D:/1.pdf");
			FileOutputStream OS = new FileOutputStream ("D:/reader.pdf");
			Pdfstamper stamper = new Pdfstamper (Pdfreader,os); Pdfcontentstreameditor editor = new Pdfcontentstreameditor () {@Override protected void write (Pdfcontentstreamp Rocessor processor, pdfliteral operator, list<pdfobject> operands) throws IOException {String Operatorst
					Ring = operator.tostring (); TJ operates through the current font and other text-related graphics state parameters to take a string of operations and draw the corresponding glyph//tr operation settings of the text rendering mode//A text object began in BT, ending in et final list<string> TE
					Xt_showing_operators = Arrays.aslist ("TJ", "'", "\ \", "TJ");
					System.out.println (operatorstring);						
						if (Text_showing_operators.contains (operatorstring)) {pdfdictionary dic = GS (). GetFont (). Getfontdictionary ();
						if (GS (). GetFont (). Getpostscriptfontname (). EndsWith ("BOLDMT")) {//BOLDMT font name return; } super.write (processor, operator, Operands);
			for (int i = 1;i <= pdfreader.getnumberofpages (); i++) {editor.editpage (Stamper, i);
		} stamper.close ();
		catch (IOException e) {//TODO auto-generated catch block E.printstacktrace ();
		catch (Documentexception e) {//TODO auto-generated catch block E.printstacktrace (); }

Renderlistener is very useful when dealing with PDF files with Itext, he is an interface inside itext, you can build a class to implement it, and then recreate its method, which can be written to deal with the specific function of PDF file. This interface, which provides a way to process text and pictures, requires you to rewrite them. I found it on stack overflow, really, like that one, is omnipotent.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.