How to use C # To extract text from files such as Word and Excel (no office installation required)

Source: Internet
Author: User

How to install the indexing service in a simplified XP system:

Go to the Internet to find the original system image that is consistent with the existing system, directly run the image to install the system, and select "Install update.

The actual effect of "Install update" is to retain existing programs and automatically install missing system components or services (so it is okay to use tools to process extra services and startup items, after installation, the system patches will not have to be installed again ).

The reason why you want to install the indexing service is that ifilter has recently been used in. Net to read the text of the Word file.

For details, see the open-source project using-ifilter-in-C #.

This method does not require the installation of office to read the text of the word. It extracts Chinese characters perfectly, and the reading speed is fast, even if the word contains tables and images.

Files such as Excel, PPT, and TXT are also supported. (Test environment: XP Professional SP3, Server 2003 enterprise SP1, Server 2008 Enterprise SP2)

In actual tests, It is also found that the Indexing Service does not affect the function even if it is not started. However, the system must have at least the Indexing Service (the system after the XP system corresponds to the Windows Search Service, see the msdn documentation ).

Someone in the garden has written an article using ifilter to extract word text before, but his program does not support Chinese Punctuation. At present, many people have discussed and tested it :. how to extract text from a document file (such as Word or PDF) under. net

I used this text extraction method to implement the COM component that does not use Word (for the method of using COM component, see Microsoft. office. interOP. word method), that is, do not install word to count the word count.

Regular Expression: RegEx. Matches (temstr ,@"(? I) [A-Z _ '0-9-] + "). count + RegEx. matches (temstr, @ "[\ u0391-\ uffe5]"). count + RegEx. matches (temstr ,@"(? I) [^ A-Z _ '0-9 \ u0391-\ uffe5-] + "). count; // Number of words + number of Chinese characters (including Chinese Punctuation) + others. Refer to csdn-related posts for regular expressions.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.