1. ikvm version of Consumer box: As far as I know, only ikvm version of Consumer box can extract text better from pdf. For more information about consumer box, visit http://www.pdbox.org,
For more information about its application instances, see http://www.codeproject.com/csharp/4102text.asp;
2. Use the acrobat SDK (this price is not cheap );
3. xpdf: if conditions permit, you can use the limit totext of xpdf,
Xpdf is a PDF parsing library written in C language. It provides multiple tools and is open to users. Source code(If you are familiar with C and DOTNET, you may compile it for your use in the DOTNET environment), but based on the gun protocol, if commercial applications require money;
More Information Access: http://www.foolabs.com/xpdf
3. ghostscript: another option to consider is ghostscript. The official website is www.cs.wisc.edu /~ Ghost/, the method for extracting text, Google ps2txt;
4. Other related resources:
Http://www.mj10777.de/NETFramework/Desktop/SharpZipLib/PdfToTxt/index.htm
Extract text from PDF file: http://www.codeproject.com/Purgatory/DotNetPDF.asp? Df= 100 & forumid = 104443
Code to extract plain text from a PDF file: http://www.codeproject.com/cpp/ExtractPDFText.asp? Df= 100 & forumid = 47947
By the way, many friends asked about the Text Extraction Method in itextsharp. Now, itextsharp does not support this function and cannot extract images, of course, I can only extract the most simple format of images (JPEG) through exploration. Others are still studying how to deal with it.