Toxy Beginner's Guide

Source: Internet
Author: User

Produced by Neuzilla

Official website: http://toxy.codeplex.com
QQ Group: 297128022

What is toxy for? It is. NET platform of the file extraction framework, mainly to solve various formats of content extraction problems, such as PDF, Doc, docx, xls, xlsx, etc., although it sounds to support a lot of formats, but its use is extremely convenient, because toxy the complex extraction process transparent, Toxy users do not have to know how the content is drawn out, this is the important meaning of toxy.

Another major goal of Toxy is to replace IFilter as a cross-platform. NET data extraction solution that supports mono on Linux. Currently all test cases can be run on mono, but a few have not, and are gradually improving.

At another level, Toxy can transform file data into unified, structured data. Currently, TOXY supports structures with

string -Text structure

toxydom -dom structure

toxyspreadsheet -form structure similar to Excel

toxydocument -descriptive text structure similar to Word

toxyemail -message structure, including recipients, senders, message contents, attachments, etc.

Toxybusinesscard -Business card structure

toxymetadata -meta data structure, which mainly contains the property information of the file, such as author, title, photo size, resolution, etc.

Toxy current main supported file formats and what can be extracted

This table is updated to version Toxy 1.4.

File format Structured objects that support extraction (types of extracted results)
Txt String
Xml Toxydom
Csv String, Toxyspreadsheet
Rtf String
Pdf String, toxydocument
HTM, html String, Toxydom
vcf String, Toxybusinesscard
Zip String
MP3, APE, WAV, FLAC, AIF Toxymetadata
JPEG, JPG, GIF, TIFF, PNG Toxymetadata
Eml String, Toxyemail
Cnm Toxyemail
XLS, xlsx String, Toxyspreadsheet, Toxymetadata
PPT, pptx Toxymetadata
Doc, docx String, Toxydocument, Toxymetadata
. VSD,. Pub,. Shw,. Sldprt, PUBX, VSDX Toxymetadata

How to use Toxy

The use of Toxy is really very simple, here is not exaggerated, see the following example:

ParserContext context = new ParserContext ("test.xlsx"); Ispreadsheetparser parser = Parserfactory.createspreadsheet ( context); Toxyspreadsheet SS = parser. Parse ();//process the extracted data

Here the Toxyspreadsheet instance SS is extracted from the Excel data, you can directly use. ParserContext is responsible for describing the extraction context, informing toxy of the path to the file to be extracted, and related parameters. Parserfactory is the Factory mode class that is responsible for instantiating all parsers, and it automatically finds the appropriate parser based on the extension of the incoming file.

The following is an extremely simple way to show the extracted code for a PDF document:

String path = Testdatasample.getpdfpath ("Sample1.pdf"), var parser = new Pdftextparser (new ParserContext (Path)), string result = parser. Parse ();

This returns a string, that is, the contents of the PDF document are extracted directly into a string, usually this kind of code Lucene. NET such a search engine with a lot more.

Let's take another Toxymetadata example:

String path = Path.GetFullPath (Testdatasample.getole2path ("Testedittime.doc")); ParserContext context = new ParserContext (path), Imetadataparser parser = parserfactory.createmetadata (context); Toxymetadata x = parser. Parse ();

This extracts the metadata information for the doc file, such as what the application created (not necessarily word), author, title, company, and so on.

As long as the details in the file attributes are listed, it is theoretically possible to extract them.

Extracting parameters of the Toxy parser

The Toxy parser not only provides the basic extraction function, but also supports the selection of the extracted content, which is realized through the properties of ParserContext.

Here is an example of an Excel extraction parameter:

ParserContext context = new ParserContext (Testdatasample.getexcelpath (filename)); Ispreadsheetparser parser = Parserfactory.createspreadsheet (context); Toxyspreadsheet SS = parser. Parse ();//Extract the page header of the table parser. CONTEXT.PROPERTIES.ADD ("Extractsheetheader", "1");//Extract the table footer parser. CONTEXT.PROPERTIES.ADD ("Extractsheetfooter", "1"); Toxyspreadsheet SS2 = parser. Parse ();

Here the Extractsheetheader and Extractsheetfooter are the parameters specified by the extractor, the spelling is not wrong, otherwise invalid, the following 1 means open, of course, if you like, you can also use on or true, The parser can automatically identify the 3 ways to represent true, and if you want to indicate false, you can use 0, off, or false.

In addition, Spreadsheetparser supports the filling of blank cells (fillblankcells), displaying formula results (Showcalculatedresult), including annotations (includescomments), and other actions. Interested can play.

Of course, each extractor can use the parameters are not the same, this article is not listed in detail, in future articles, we will detail the parameters of each extractor and corresponding extract content.

Toxy Advanced Extension Features

Toxy In addition to providing the basic extraction function, but also provides some advanced object transformation Services, such as Toxyspreadsheet to the dataset, you can directly convert Excel data into a dataset for easy invocation and processing. The code is super simple, as shown below:

ParserContext c=new ParserContext (@ "C:\employee.xls"); var parser=parserfactory.createspreadsheet (c); var spreadsheet= parser. Parse ();D ataset ds = spreadsheet. Todataset ();

This is 4 lines of code, is not super cool Ah! In addition, Toxyspreadsheet sub-structure toxytable support todatatable operation, the usage is similar, directly call can.

Toxy Functional Outlook

The goal of the Toxy 1.x is to support a sufficient number of file formats and convert them into a unified structure for extraction.

The goal of the Toxy 2.x-3.x is to support the interoperability of similar files (conversion), such as Excel to CSV, Excel to HTML, Word to PDF, and, of course, a bit of a long, slow path.

later. NET camp will not be laughed at by the Java camp to say that even a decent extraction framework is not, the Java camp has Tika, we. NET has Toxy,oh, yeah!

Toxy Beginner's Guide

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.