I haven't written my own experience logs for a long time to share with you. On the one hand, I am a little busy. On the other hand, I am a little lazy and have not summarized it in time. Because practice is the source of experience and summary is the basis for improvement, you should reflect on it in any case. Today I mainly studied the related classes of two PDF documents, itextsharp and product_box. The starting point of my research is to achieve the retrieval of PDF documents. We need to extract the text content from the PDF documents and then implement the search through regular expression matching.
《File search systems similar to Windows SearchThe file retrieval method described in "is very good, but it does not support Chinese retrieval in PDF, because itextsharp called in it does not support English very well. getpagecontent () of pdfreader class () the method cannot normally return Chinese characters. I tested that this is not a simple encoding problem. Therefore, the text extraction function from PDF is urgently needed.
I first learned about itextsharp. dll download: http://sourceforge.net/projects/itextsharp/ here there are a lot of simple examples of output PDF document (download itextsharp example), Learning found that does not support Chinese content output. Searching for relevant content on the Internet found that the library was missing. There are two solutions:
1. Specify the system font library and create the font used in the PDF file. See http://unruledboy.cnblogs.com/Skins/ChinaHeart/Controls/archive/2005/08/30/225984.html
Document document = new document (pagesize. A4, 50, 50, 50, 50 );
Try
{
Using writer = Using writer. getinstance (document, new filestream ("chap11.pdf", filemode. Create ));
// The following is an encrypted PDF file.
// Writer. setencryption (using writer. strength40bits, "654321", "654321", using writer. allowcopy );
Document. open ();
// Specify the font library and create a font
Basefont = basefont. createfont (
"C: \ Windows \ fonts \ simhei. TTF ",
Basefont. identity_h,
Basefont. not_embedded );
Itextsharp. Text. Font font = new itextsharp. Text. Font (basefont, 9 );
// Specify the font of the output content
Document. Add (new paragraph ("this document is top secret! ",Font));
Document. Close ();
}
Catch (Exception de)
{
Console. writeline (De. stacktrace );
}
2. Download the extension library itextasiancmaps. dll and itextasian. dll from the http://sourceforge.net/projects/itextsharp/, supporting Asian fonts.
The download page is as follows:
//
// create a Chinese font (Chinese)
///
///
Public static itextsharp. text. font createchinesefont ()
{< br> basefont. addtoresourcesearch ("itextasian. DLL ");
basefont. addtoresourcesearch ("itextasiancmaps. DLL "); //" stsong-light "," UniGB-UCS2-H ",
basefont baseft = basefont. createfont ("stsong-light", "UniGB-UCS2-H", basefont. embedded);
Itextsharp. Text. Font font = new itextsharp. Text. Font (baseft );
Return font;
}
UniGB-UCS2-H is simplified Chinese. "Stsong-light" is the font name. Basefont. Embedded is used to embed fonts in the document.
Next, I tried to specify the font library when using itextsharp to read the object class, but unfortunately there is no corresponding method. Please refer to: http://www.cnblogs.com/diction/articles/1120984.html (extracted text does not support Chinese) and, even if there is also not flexible, because you can not predict the fonts used in the PDF document, PDF documents may have a variety of fonts. Later, I searched for webpage information and found that:The advantage of itextsharp's PDF operation is the creation of PDF documents.
Demand is the motivation for learning and work
My original goal is to find a method for extracting the content of a PDF document as text. I switched to how to parse PDF filesArticleThis document describes how to extract text from a PDF file and how to solve the problem. I will repost this article separately, hoping that users who cannot access a foreign network can also see it. Product_box download http://sourceforge.net/projects/pdfbox/files/ after downloading and unzipping the content is very rich,
All required DLL files are included in the bin folder.
"Product_box is a Java PDF library. this project will allow access to all of the components in a PDF document. more PDF manipulation features will be added as the project matures. this ships with a utility to take a PDF document and output a text file."
Javasbox is a Java open-source project that uses ikvm. Net open-source project http://www.ikvm.net/supports Java class libraries called in. net.
Ikvm. NET is an implementation of Java for mono and the Microsoft. NET Framework. It implements des the following components:
- A Java Virtual Machine implemented in. net
- A. Net Implementation of the Java class libraries
- Tools that enable Java and. Net interoperability
Learning ikvm. NET is helpful for using Java class libraries in. net. In factIkvm. runtime. dllIt encapsulates the runtime environment of the Java class library.
The DLL to be added are: FontBox-0.1.0-dev.dll, ikvm. GNU. classpath. dll, ikvm. runtime. dll, PDFBox-0.7.3.dll
Consumer box instanceCodeSee: http://www.cnblogs.com/wuhenke/archive/2010/04/16/1713949.html
Private Static string parseusingdomainbox (string filename)
{
Pddocument Doc = pddocument. Load (filename );
Required textstripper stripper = new required textstripper ();
Return stripper. gettext (DOC );
}
The product_box feature is very powerful and it is worth learning.
Refer:
Http://www.codeproject.com/kb/cpp/ExtractPDFText.aspx? Df= 100 & forumid = 47947
Http://www.codeproject.com/KB/string/pdf2text.aspx
Http://www.cnblogs.com/hardrock/
Http://www.ikvm.net/