Use the minidx extract-text COM component from word, xls, PDF ...... Read text from other files

Source: Internet
Author: User
Tags pings
ArticleDirectory
    • You may also be interested in the following articles:
    • Trackbacks
Use the minidx extract-text COM component from word, xls, PDF ...... Read text from other files

ByMinidxer| December 31,200 7

 

 

 

Many people are amazed at the fact that Google, Baidu, and other search engines can "find" Your Word doc, Excel xls, PDF, and other files on the server, many people sent emails asking me how the minidx File Manager can read text content from files in various formats. The implementation of the Linux platform is more complicated. However, for Windows users, the above functions can be easily implemented by using the Microsoft ifilter Indexing Service Interface. Minidx supports more than 200 file formats and uses ifilter interfaces. The basic principle of implementation is to write a COM component to find the dll path of the API interface in the corresponding file format in the system, and then call to extract text.

 


 

 

In the previous Microsoft ifilter introduction, I will not repeat the basic concepts of ifilter. In addition, in the using ifilter in C # Of codeproject, you can find the implementation of C # and download itSource code. For performance considerations, this part of the minidx File Manager is fully implemented using C ++ and encapsulated as a COM component. The following describes how to call this COM component in your ownProgramTo implement the doc, xls, PDF, MSG and other text reading functions.

 

● Demo compressed package construction (the compressed package can be downloaded from here)

 

Demo_vb.zip ---- demo_vb.sln solution file

 

| --- Demo_vb.suo user configuration file

 

| ---Demo_vb--bin---debug--demo_vb.exe target file of DEBUG

 

| ---Release-demo_vb.exe release target file

 

|-My project (ignore)

 

|-Temporary file generated by OBJ Compilation

 

|-Demo_vb.vbproj project file

 

|-Extracttext. dll Text Extraction COM Component

 

|-Form1.designer. VB demo GUI File

 

|-Form1.resx resource file

 

|-Form1.vb demo SourceCodeFile

 

|-Run. Bat COM component registration command

 

● Run demo

 

① Double-click Run. BAT to execute and register the COM component.

 

 

② Double-click demo_vb \ bin \ release or demo_vb \ bin \ debugdirectory's demo_vb.exe

 

 

③ Click "file" and select an object file.

 

 

④ Select a file and view the extracted text results. (The following are the results of Chinese Japanese and English word extraction)

 

 

Note: You must have the read and write permissions to extract text objects. An error may occur when extracting text from a file being edited.

 

● DEMO code Description

 

First of all, you need to import the DLL component. The following code is very simple and has comments. It should not be difficult to understand. I will not describe it here. If you have any questions or problems, we can discuss it together ,:)

 

 
1:'-----------------------------
2:'Desc': extract text from selected file
 
3:'-----------------------------
 
4:Private SubSelectfiledialog_fileok (ByvalSenderAsSystem.Object,ByvalEAsSystem. componentmodel. canceleventargs)HandlesSelectfiledialog. fileok
 
5:On Error GotoErrh
 
6: 
7:'Get selected full-path file name
 
8:DimSfileAs String= Selectfiledialog. filename
 
9: 
 
10:'Set file name
 
11:Txtfilepath. Text = sfile
 
12: 
 
13:'Set extracttext Module
14:DimTeAsExtracttextlib. textextractor =NewExtracttextlib. textextractor
 
15:DimStextAs String
 
16:Stext = tE. extracttext (sfile, max_extract_text_size)
 
17:Err. Clear ()
 
18: 
 
19:'Me. filelength. Text = Len (stext). tostring ()
 
20:Me. Extracttext. Text = stext
 
21: 
22:Exit Sub
 
23:Errh:
 
24:Msgbox ("Error extracting text from '"& Sfile &"'Err ="& Err. Number &"-"& Err. Description &"In"& Err. Source)
 
25:EndSub

Download the DLL and demo source code of the project from here. The C/C ++ call demo is still being prepared. Interested experts can also make PHP, Delphi, C # demo called in various languages for sharing ~~, The principle is actually the same. If you have any questions, you can leave a message here or post it to the minidx help Forum for help. This module can be used for any commercial or non-commercial purposes. If you want to, you can send an email to me informing me that this module is used in your project, so when you succeed, I can also boast to my friends that, of course, this is not necessary ,:)

 Topics:Minidx related | 29 comments» | 830 viewsTags:C ++, COM component, Doc, extract text, ifilter, indexing service, minidx, PDF, VB.net, xls, search engine

 

 

You may also be interested in the following articles:
    • Use the minidx extract-text COM component from Doc, xls, PDF ...... VC demo
    • Use the minidx extract-text COM component to encapsulate class implementation
    • Use the minidx extract-text COM component from Doc, xls, PDF ...... And other reading text content vc2003 demo
    • Swig version 1.3.33 released
    • Minidx. rc1.1 has downloaded more than 1 kb of Chinese and English versions.

 

 

 

  1. Qinai-12/31/2007 at 11: 41 pm

    I was touched and surprised when I saw your email. No, thank you!

     

    Happy New Year! I will always come to study with you later. Thank you!

     

     

     

  2. Minidxer-12/31/2007 at 11: 50 pm

    @ Qinai
    Thank you ~~~~

     

     

  3. Sorryle-01/1/2008 at AM

    Don't talk about technology, just celebrate the New Year, Happy New Year, minidx

     

     

  4. Minidxer-01/1/2008 at 8: 47 AM

    @ Sorryle
    Thank you ~~~ Happy New Year, sorryle

     

     

  5. Yiyix-01/1/2008 at pm

    Don't understand. New Year's greetings ~~~

     

     

  6. Minidxer-01/1/2008 at :22 pm

    @ Yiyix
    Haha, New Year's greetings ~~~

     

     

  7. Figure-at pm

    Very good. I will see it later.

     

     

  8. Minidxer-01/1/2008 at 4: 58 pm

    @ Graph
    Thank you ~ Oh, you are welcome.

     

     

  9. Tip-01/2/2008 at 9:39 AM

    Originally, hoho
    Learning

     

     

  10. PP-01/9/2008 at 9:42 pm

    What is the relationship between extract. dll and intero. extracttextlib. dll?
    The COM component does not provide interface descriptions. You are not guided in development.

     

     

  11. Minidxer-01/10/2008 At 12:53 AM

    @ PP
    Refer to here: http://blog.minidx.com/2008/01/10/373.html
    Intero. extracttextlib. dll is generated during compilation and is deleted.

     

     

  12. Skybright-04/14/2008 at :51 pm

    Why did I get the source code of vc2005 wrong? ------- configuration: demo_vc-Win32 debug -------
    Compiling...
    Demo_vc.cpp
    E: \ My Documents \ demo_vc2005_wrap \ demo_vc2005_wrap \ demo_vc \ demo_vc.cpp (16): Error c2059: syntax error :'&&'
    E: \ My Documents \ demo_vc2005_wrap \ demo_vc2005_wrap \ demo_vc \ demo_vc.cpp (16): Error c2143: syntax error: Missing '; 'before '}'
    E: \ My Documents \ demo_vc2005_wrap \ demo_vc2005_wrap \ demo_vc \ demo_vc.cpp (16): Error c2143: syntax error: Missing '; 'before '}'
    E: \ My Documents \ demo_vc2005_wrap \ demo_vc2005_wrap \ demo_vc \ demo_vc.cpp (17): Error c2143: syntax error: Missing '; 'before '{'
    E: \ My Documents \ demo_vc2005_wrap \ demo_vc2005_wrap \ demo_vc \ demo_vc.cpp (17): Error c2447: missing function header (old-style formal list ?)
    E: \ My Documents \ demo_vc2005_wrap \ demo_vc2005_wrap \ demo_vc \ demo_vc.cpp (17): Error c2143: syntax error: Missing '; 'before '}'
    Error executing cl.exe.

     

     

  13. Skybright-04/14/2008 at :53 pm

    Some components I just learned are not quite familiar. Thank you for your advice!

     

     

  14. Minidxer-04/14/2008 at 4: 00 pm

    @ Skybright

     

    Are you using vs2005?
    There is no problem with compilation here.

     

    In fact, you can create a project by yourself
    Extracttext. dll
    Extracttext. h
    Extracttext_ I .c
    And several lines of code related to demo_vc.cpp are copied, and the component can be registered as described above.

     

     

     

     

  15. Skybright-04/14/2008 at 4:33 pm

    I will try it .. Thank you!

     

     

  16. Heroyo-08/19/2008 at pm

    Hello, I have used your extracttext in the system's Lucene full-text indexing function. DLL components, which are used to intercept attachments in TXT and Office formats. It is very helpful for you to compile such a useful attachment on the Internet.
    In the program, when max_extract_text_size changes, 64 MB is specified to retrieve the upper limit of the text content, and 64 MB is not too busy, what is the upper limit for the outbound bandwidth? Or why is it 64 MB?
    Too many!

     

     

  17. Heroyo-08/20/2008 at 12: 11 pm

    Another question is, how is extracttext. dll generated?
    1. microsft components? 2. What components have you created?

     

     

  18. Minidxer-08/20/2008 at 12: 16 pm

    @ Heroyo
    For the first question, see:
    Http://forum.minidx.com/thread-83-1-1.html

     

    For the second question, refer to the description at the beginning of the body,
    Extracttext. dll is a COM component written based on Microsoft ifilter.

     

     

     

  19. Icen-11/19/2008 at pm

    Hello! How can I read and display the word in Flash as3.0?

     

     

  20. Sky-06/5/2009 at 4:33 pm

    Is there any sheet extracted from excel in the order of each row.

     

    The current method extracts all the information at a time, and the content is not known in what order.

     

     

     

  21. Lwp-07/15/2009 at pm

    Hello:
    I sent an email to your minidxer@gmail.com mailbox. Please check it. Please help me.

     

     

  22. Pang-08/6/2009 at 11: 40 AM

    Consultation on reading text from a file:
    Currently, the document extraction interface is as follows:
    Itextextractor * TE = NULL;
    Hresult COHR = cocreateinstance (clsid_textractor, null, clsctx_inproc_server, iid_itextextractor, (void **) & Te );
    If (succeeded (COHR) & Te)
    {
    Te-> extracttext (filename, 0, & BSTR );
    Te-> release ();
    }
    When using the extracttext interface to extract text, the file type is determined by filename, But I want to add a parameter to extracttext: filetype (file type ), in this way, I can read text in the specified format, for example:
    Parse, extract the text and then change the file back to A. text. This operation is troublesome. If the interface adds parameters about the file type, I don't need to rename it.
    I don't know if you can provide such an interface. I am currently in need of such an interface for a project. I am very grateful if you can provide it !!!

     

     

  23. Minidxer-08/8/2009 at 8:20 am

    @ Pang
    This interface can be added, but the project is basically stuck due to personal time and other reasons. Sorry.

     

     

  24. Tyq-10/12/2009 at 5:06 pm

    Can I have a VC demo? Thank you.

     

     

  25. Tyq-10/12/2009 at 5:07 pm

    My mail: tangyq169@sohu.com

     

     

  26. Minidxer-10/12/2009 at 8:54 pm

    Can be downloaded from the http://blog.minidx.com/2008/01/10/373.html.

     

     

Trackbacks
    • Full-text search blog» difference and connection between "full-text search" and "Search Engine"
    • Full-text retrieval blog» use the minidx extract-text COM component from Doc, xls, PDF ...... VC demo
    • Use the minidx extract-text COM component to encapsulate the class | full-text search blog

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.