In the. net environment, some methods for extracting Text from PDF files are summarized.
1. IKVM version of Consumer box: As far as I know, only IKVM version of Consumer box can better extract text from PDF. For more information about consumer box, visit the http://www.pdbox.org, see http://www.codeproject.com/csharp/4102text.asp; on CodeProject;
2. Use the Acrobat SDK (this price is not cheap );
3. XPDF: if conditions permit, you can consider using XPDF's PDFToText. XPDF is a PDF parsing library written in C language and provides multiple tools and open source code (if you are familiar with C and dotnet, maybe you can compile it for your use in the dotnet environment), but based on the GUN protocol, if commercial applications require money; more information visit: http://www.foolabs.com/xpdf
4. Ghostscript: another option to consider is Ghostscript. The official website is www.cs.wisc.edu /~ Ghost/, the method for extracting Text, google ps2txt;
5. Other related resources:
Http://www.mj10777.de/NETFramework/Desktop/SharpZipLib/PdfToTxt/index.htm
Extract Text from PDF File: http://www.codeproject.com/Purgatory/DotNetPDF.asp? Df= 100 & forumid = 104443
Code to extract plain text from a PDF file: http://www.codeproject.com/cpp/ExtractPDFText.asp? Df= 100 & forumid = 47947
By the way, many friends asked about the Text Extraction Method in iTextSharp. Now, iTextSharp does not support this function and cannot extract images, of course, I can only extract the most simple format of images (jpeg) through exploration. Others are still studying how to deal with it.