C # Convert PDF to text

Source: Internet
Author: User

Update

April February 27, 2014:This article initially only describes how to use product_box to parse PDF files. Now it has been extended to include routines using IFilter and iTextSharp.

This article and the corresponding Visual Studio project have been updated to the latest product_box version (1.8.4 ). You can download the complete project from the http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/ that contains all the dependencies (which is tricky to remove dependencies ).

How to parse PDF files

There are several main methods for extracting text from a PDF file in. NET:

  • Microsoft IFilter interface and Adobe IFilter implementation;

  • ITextSharp;

  • Product_box.

Unfortunately, these PDF resolution schemes are not perfect. We will discuss these methods below.

Adobe PDF IFilter

To use the IFilter interface to parse PDF files, you need:

  • Windows 2000 or later versions

  • Adobe Acrobat or Reader 7.0.5 + (or a separate Adobe PDF IFilter [adobe.com])

  • IFilter COM encapsulation class [dotlucene.net]

Sample Code:

1234567 using IFilter; // ... public static string ExtractTextFromPdf(string path) {  return DefaultParser.Extract(path); }

Disadvantages:

ITextSharp

ITextSharp (http://sourceforge.net/projects/itextsharp/) is a Java PDF operating library iText (http://itextpdf.com/). NET output. It mainly focuses on editing PDF files rather than reading them, but it certainly supports extracting text from PDF files (though a little small ).

Routine:

12345678910111213141516171819 using iTextSharp.text.pdf;using iTextSharp.text.pdf.parser; // ...  public static string ExtractTextFromPdf(string path){  using (PdfReader reader = new PdfReader(path))  {    StringBuilder text = new StringBuilder();     for (int i = 1; i <= reader.NumberOfPages; i++)    {        text.Append(PdfTextExtractor.GetTextFromPage(reader, i));    }     return text.ToString();  }}

Letter of Credit: Member No. 10364982

Disadvantages:

Product_box

Consumer box is another Java PDF class library. It can also be used with the original Java Lucene (see luceneworkflow document ).

Fortunately, javasbox has a. NET version developed using IKVM. NET (you only need to access the javasbox download page ).

To use consumer box in. NET, you need to reference:

  • IKVM. OpenJDK. Core. dll

  • IKVM. OpenJDK. SwingAWT. dll

  • Pdfbox-1.8.4.dll

And copy the following files to the bin Folder:

  • Commons-logging.dll

  • Fontbox-1.8.4.dll

  • IKVM. OpenJDK. Util. dll

  • IKVM. Runtime. dll

Using product_box to parse PDF is very simple:

12345678910111213141516171819 using org.apache.pdfbox.pdmodel;using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path){  PDDocument doc = null;  try {    doc = PDDocument.load(path)    PDFTextStripper stripper = new PDFTextStripper();    return stripper.getText(doc);  }  finally {    if (doc != null) {      doc.close();    }  }}

The size after compilation is about 18 MB:

  • IKVM. OpenJDK. Core. dll (4 MB)

  • IKVM. OpenJDK. SwingAWT. dll (6 MB)

  • Pdfbox-1.8.4.dll (4 MB)

  • Commons-logging.dll (82 kB)

  • Fontbox-1.8.4.dll (180 kB)

  • IKVM. OpenJDK. Util. dll (2 MB)

  • IKVM. Runtime. dll (1 MB)

Speed: It takes 13 seconds to parse the U. S. Copyright Act PDF (5.1 MB) file.

Thanks to the improvement suggestions provided by bobrien100.

Disadvantages:

Related information
  • You can see this article (and subsequent updates) on SquarePDF. NET ).

Address: http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.