This article transferred from: http://blog.csdn.net/wangqiuyun/article/details/8548779
There are two main libraries used to read PDF text under. NET: PDFBox and Itextsharp.
First said PDFBox, this class library is said to be very powerful, I am just a brief introduction:
1. Download PDFBox
: http://sourceforge.net/projects/pdfbox/
2. referencing the dynamic link library
To extract the downloaded PDFBox, locate the bin directory in which the referenced DLL file needs to be added in the project: IKVM. Gnu. Classpath.dll Pdfbox-0.7.3.dll Fontbox-0.1.0-dev.dll IKVM. Runtime.dll references the above 4 files to the project, the following 2 namespaces need to be introduced in the file: using Org.pdfbox.pdmodel; Using Org.pdfbox.util;
3, the use of the API to see the code:
[CSharp]View Plaincopyprint?
- Using Org.pdfbox.pdmodel;
- Using Org.pdfbox.util;
- Public void Pdf2txt (FileInfo file,fileinfo txtfile)
- {
- PDDocument doc = pddocument.load (file. FullName);
- Pdftextstripper pdfstripper = new Pdftextstripper ();
- string text = Pdfstripper.gettext (DOC);
- StreamWriter Swpdfchange = new StreamWriter (txtfile. FullName, false, Encoding.GetEncoding ("gb2312"));
- Swpdfchange.write (text);
- Swpdfchange.close ();
- }
Using org.pdfbox.pdmodel;using org.pdfbox.util;public void Pdf2txt (FileInfo file,fileinfo txtfile) { pddocument doc = pddocument.load (file. FullName); Pdftextstripper pdfstripper = new Pdftextstripper (); string text = Pdfstripper.gettext (doc); StreamWriter Swpdfchange = new StreamWriter (txtfile. FullName, False, Encoding.GetEncoding ("gb2312")); Swpdfchange.write (text); Swpdfchange.close ();}
Itextsharp, in fact, is often used to generate PDFs, but his ability to read PDFs is not bad, using the following:
1. Download Itextsharp
: http://sourceforge.net/projects/itextsharp/
2. referencing the dynamic link library
Unzip the downloaded compressed package inside the Itextsharp-dll-core.zip, get itextsharp.dll, add Reference Itextsharp.dll in the project can be in the file to introduce the following 3 namespaces: Using Itextsharp; Using Itextsharp.text; Using ITextSharp.text.pdf;
3, the use of the API to see the code:
[CSharp]View Plaincopyprint?
- Private string oncreated (string filepath)
- {
- Try
- {
- string pdffilename = filepath;
- Pdfreader Pdfreader = new Pdfreader (Pdffilename);
- int numberofpages = pdfreader.numberofpages;
- string text = string. Empty;
- For (int i = 1; I <= numberofpages; ++i)
- {
- byte[] bufferofpagecontent = pdfreader.getpagecontent (i);
- Text + = System.Text.Encoding.UTF8.GetString (bufferofpagecontent);
- }
- Pdfreader.close ();
- return text;
- }
- catch (Exception ex)
- {
- StreamWriter Wlog = File.appendtext (system.appdomain.currentdomain.setupinformation.applicationbase+"\ \ MyLog.log ");
- Wlog. WriteLine ("error file:" + E.fullpath + "Cause:" + ex.) ToString ());
- Wlog. Flush ();
- Wlog. Close (); return null;
- }
- }
[Go]. NET reads the PDF text