Data extraction tools and extraction tools
Supports Text Extraction from almost all software versions, such as office, pdf, mail, and compressed files, as well as text extraction from attachments in emails, compressed files, and embedded files..
DMCTextFilter is a generic library product developed and developed by Beijing hongyingfeng software Co., Ltd. This product can completely remove special control information from data of various document formats or inserted OLE objects, and quickly extract plain text data information. This allows you to centrally manage, edit, retrieve, and browse various document data resources. This product adopts the advanced multi-language, multi-platform, multi-thread design concept, supports multiple languages (English, simplified Chinese, traditional Chinese, Japanese, Korean), a variety of operating systems (Windows, solaris, Linux, ibm aix, Macintosh, HP-UNIX), multi-text collection code (GBK, GB18030, Big5, ISO-8859-1, ks x 1001, Shift_JIS, WINDOWS31J, EUC-JP, ISO-10646-UCS-2, ISO-10646-UCS-4, UTF-16, UTF-8, etc ). API interfaces are provided in various forms (File Format recognition functions, text extraction functions, File Attribute Extraction functions, page extraction functions, and text extraction functions for User Password PDF files ), easy to use. Users can easily assemble the product into their own applications for secondary development. By calling the APIS provided by this product, you can quickly extract plain text data from data in multiple document formats. This product has been widely used at home and abroad, and has won high praise from users in terms of product performance and quality.
Features:
1. Automatic Identification of file formats
This product automatically identifies the application name and version number of the generated file by parsing the internal information of the file. It does not depend on the file extension and can correctly identify the file format and the corresponding version information. The recognizable file format is as follows: supports Microsoft Office, RTF, PDF, Visio, Outlook EML and MSG, Lotus1-2-3, HTML, AutoCAD DXF and DWG, IGES, PageMaker, ClarisWorks, AppleWorks, XML, WordPerfect, Mac Write, Works, Corel Presentations, QuarkXpress, DocuWorks, WPS, LZH/ZIP/RAR of compressed files, and itaro, OASYS and other file formats
2. Text Extraction
Even if the system does not install an application as a file, you can extract text data from the specified file or the OLE inserted into the file.
3. File property Extraction
Extracts file attributes from the specified file.
4. Page Extraction
Extract text data from the specified page.
5. Extract encrypted PDF file text
Extract text data from a PDF file with a password to open the document.
6. Stream Extraction
Extract text data from a specified file or an OLE object embedded in the file to the stream.
7. Supported languages
This product supports the following languages: English, simplified Chinese, traditional Chinese, Japanese, and Korean
8. Supported types of character sets
When extracting text, you can specify the following character set to work together as the character set of the text file (you can also specify any special character set, but need to be customized development): GBK, GB18030, Big5, ISO-8859-1, ks x 1001, Shift_JIS, WINDOWS31J, EUC-JP, ISO-10646-UCS-2, ISO-10646-UCS-4, UTF-16, UTF-8, etc.
Currently, DMCTextFilter V4.2 generic library for text extraction has been widely used in digital libraries, search engines, full-text searches, databases, and other fields. It is favored by many well-known enterprises around the world. This product is highly praised by users for its performance and quality. Our company will make unremitting efforts to continue to provide users with first-class products with excellent quality and reliable performance. It provides excellent technical services to meet various needs of users.
In actual promotion and application, our company's generic text extraction program software has been applied to many fields, such as: information resource development and utilization, intelligent search engine, intelligence analysis and service, information security, enterprise knowledge portal, digital library, e-commerce and other fields.
Currently, the following application values are highlighted:
1) provides intelligent processing tools for massive unstructured resources to improve the processing efficiency of information resources. At the same time, it can provide smart retrieval and mining and analysis means for users of government information resources, enlarge the value-added utility of government information resources.
2) This software has been successfully applied to search engines of relevant national departments and the construction of vertical search engine services in multiple industries, it lays the foundation for improving the intelligence, industry orientation, and knowledge-based level of vertical search engine services.
3) This software provides intelligent technology for relevant organizations to engage in content security management, which can reduce regulatory costs and improve regulatory efficiency.
4) This software can be used as a basic component for information resource utilization and knowledge management applications, providing advanced and Intelligent Text Conversion Technology for the processing, analysis and service of enterprise information resources.
Conversion tools supporting oracle data table synchronization or data extraction
Kettle is an open-source ETL Tool written in java. It can be run on Windows, Linux, and Unix. It does not need to be installed green, and data extraction is efficient and stable.
For details, refer to Baidu encyclopedia. The address is as follows: baike.baidu.com/view/2486337.htm.
Data Retrieval (Data Mining) Tool
A simple kettle conversion tool like this can certainly handle the problem. It should be because there is a problem with your design file, or the steps are not properly configured. Let's think about it, the open-source ETL software has a higher acceptance rate.