C # implements the ability to turn PDF into text

Source: Internet
Author: User

Update

February 27, 2014: This article initially only describes the use of PDFBox to parse PDF files. Now it has been extended to include routines that use IFilter and Itextsharp.

This article and the corresponding Visual Studio project have been updated to the current PDFBox version (1.8.4). You can download a complete project from http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/that contains all dependent content (a bit tricky to eliminate dependencies).

How to parse a PDF file

Some of the main ways to extract text from PDF files in. NET are:

Microsoft's IFilter interface and Adobe's IFilter implementation;

Itextsharp;

PDFBox.

Unfortunately, these PDF parsing schemes are not perfect. We will discuss these methods below.

Adobe PDF IFilter

In order to use the IFilter interface to parse PDF files, you need to:

Windows 2000 or later

Adobe Acrobat or Reader 7.0.5+ (or a separate Adobe PDF IFilter [adobe.com])

IFilter COM Encapsulation class [Dotlucene.net]

Sample code:

Using IFilter; ... public static string Extracttextfrompdf (string path) {  return defaultparser.extract (path);}

Disadvantages:

Using unreliable COM interop to handle the IFilter interface (and combining IFilter com, Adobe PDF IFilter is particularly troublesome).

Adobe IFilter needs to be installed separately on the target system. If you need to publish an indexed solution to someone else, it can be painful.

Itextsharp

Itextsharp (http://sourceforge.net/projects/itextsharp/) is a. NET output of the Java PDF Operations Library IText (http://itextpdf.com/). It focuses on editing PDFs rather than reading, but it certainly supports extracting text from PDFs (albeit a bit overqualified).

Routines:

Using itextsharp.text.pdf;using iTextSharp.text.pdf.parser; // ...  public static string Extracttextfrompdf (string path) {  using (pdfreader reader = new Pdfreader (path))  {    StringBuilder Text = new StringBuilder ();     for (int i = 1; I <= reader. Numberofpages; i++)    {        text. Append (Pdftextextractor.gettextfrompage (reader, i));    }     return text. ToString ();  }}

L/C: Member No. 10364982

Disadvantages:

Need a license (if you don't like the AGPL license)

PDFBox

PDFBox is another Java PDF class library. It can also be used in conjunction with the original Java Lucene (see Lucenepdfdocument).

Fortunately, PDFBox has a. NET version developed using Ikvm.net (just visit the PDFBox download page).

Using PDFBox in. NET requires a reference:

IKVM. OpenJDK.Core.dll

IKVM. OpenJDK.SwingAWT.dll

Pdfbox-1.8.4.dll

and copy the following files to the Bin folder:

Commons-logging.dll

Fontbox-1.8.4.dll

IKVM. OpenJDK.Util.dll

IKVM. Runtime.dll

Parsing PDFs with PDFBox is simple:

Using org.apache.pdfbox.pdmodel;using org.apache.pdfbox.util; ... private static string Extracttextfrompdf (string path) {  pddocument doc = null;  try {    doc = pddocument.load (path)    pdftextstripper stripper = new Pdftextstripper ();    Return Stripper.gettext (DOC);  }  Finally {    if (doc! = null) {      doc.close ();}}  }

The compiled size adds up to almost 18MB:

IKVM. OpenJDK.Core.dll (4 MB)

IKVM. OpenJDK.SwingAWT.dll (6 MB)

Pdfbox-1.8.4.dll (4 MB)

Commons-logging.dll (KB)

Fontbox-1.8.4.dll (KB)

IKVM. OpenJDK.Util.dll (2 MB)

IKVM. Runtime.dll (1 MB)

Speed can also: Parse U.S. Copyright Act PDF (5.1 MB) file for 13 seconds.

Thank bobrien100 for the suggestions for improvement.

Disadvantages:

IKVM. NET dependency (MB)

Speed (especially ikvm.net start-up time)

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.