Produced by Neuzilla
Official website: http://toxy.codeplex.com
QQ Group: 297128022
What is toxy for? It is. NET platform of the file extraction framework, mainly to solve various formats of content extraction problems, such as PDF, Doc, docx, xls, xlsx, etc., although it sounds to support a lot of formats, but its use is extremely convenient, because toxy the complex extraction process transparent, Toxy users do not have to know how the content is drawn out, this is the important meaning of toxy.
Another major goal of Toxy is to replace IFilter as a cross-platform. NET data extraction solution that supports mono on Linux. Currently all test cases can be run on mono, but a few have not, and are gradually improving.
At another level, Toxy can transform file data into unified, structured data. Currently, TOXY supports structures with
string -Text structure
toxydom -dom structure
toxyspreadsheet -form structure similar to Excel
toxydocument -descriptive text structure similar to Word
toxyemail -message structure, including recipients, senders, message contents, attachments, etc.
Toxybusinesscard -Business card structure
toxymetadata -meta data structure, which mainly contains the property information of the file, such as author, title, photo size, resolution, etc.
Toxy current main supported file formats and what can be extracted
This table is updated to version Toxy 1.4.
File format |
Structured objects that support extraction (types of extracted results) |
Txt |
String |
Xml |
Toxydom |
Csv |
String, Toxyspreadsheet |
Rtf |
String |
Pdf |
String, toxydocument |
HTM, html |
String, Toxydom |
vcf |
String, Toxybusinesscard |
Zip |
String |
MP3, APE, WAV, FLAC, AIF |
Toxymetadata |
JPEG, JPG, GIF, TIFF, PNG |
Toxymetadata |
Eml |
String, Toxyemail |
Cnm |
Toxyemail |
XLS, xlsx |
String, Toxyspreadsheet, Toxymetadata |
PPT, pptx |
Toxymetadata |
Doc, docx |
String, Toxydocument, Toxymetadata |
. VSD,. Pub,. Shw,. Sldprt, PUBX, VSDX |
Toxymetadata |
How to use Toxy
The use of Toxy is really very simple, here is not exaggerated, see the following example:
ParserContext context = new ParserContext ("test.xlsx"); Ispreadsheetparser parser = Parserfactory.createspreadsheet ( context); Toxyspreadsheet SS = parser. Parse ();//process the extracted data
Here the Toxyspreadsheet instance SS is extracted from the Excel data, you can directly use. ParserContext is responsible for describing the extraction context, informing toxy of the path to the file to be extracted, and related parameters. Parserfactory is the Factory mode class that is responsible for instantiating all parsers, and it automatically finds the appropriate parser based on the extension of the incoming file.
The following is an extremely simple way to show the extracted code for a PDF document:
String path = Testdatasample.getpdfpath ("Sample1.pdf"), var parser = new Pdftextparser (new ParserContext (Path)), string result = parser. Parse ();
This returns a string, that is, the contents of the PDF document are extracted directly into a string, usually this kind of code Lucene. NET such a search engine with a lot more.
Let's take another Toxymetadata example:
String path = Path.GetFullPath (Testdatasample.getole2path ("Testedittime.doc")); ParserContext context = new ParserContext (path), Imetadataparser parser = parserfactory.createmetadata (context); Toxymetadata x = parser. Parse ();
This extracts the metadata information for the doc file, such as what the application created (not necessarily word), author, title, company, and so on.
As long as the details in the file attributes are listed, it is theoretically possible to extract them.
Extracting parameters of the Toxy parser
The Toxy parser not only provides the basic extraction function, but also supports the selection of the extracted content, which is realized through the properties of ParserContext.
Here is an example of an Excel extraction parameter:
ParserContext context = new ParserContext (Testdatasample.getexcelpath (filename)); Ispreadsheetparser parser = Parserfactory.createspreadsheet (context); Toxyspreadsheet SS = parser. Parse ();//Extract the page header of the table parser. CONTEXT.PROPERTIES.ADD ("Extractsheetheader", "1");//Extract the table footer parser. CONTEXT.PROPERTIES.ADD ("Extractsheetfooter", "1"); Toxyspreadsheet SS2 = parser. Parse ();
Here the Extractsheetheader and Extractsheetfooter are the parameters specified by the extractor, the spelling is not wrong, otherwise invalid, the following 1 means open, of course, if you like, you can also use on or true, The parser can automatically identify the 3 ways to represent true, and if you want to indicate false, you can use 0, off, or false.
In addition, Spreadsheetparser supports the filling of blank cells (fillblankcells), displaying formula results (Showcalculatedresult), including annotations (includescomments), and other actions. Interested can play.
Of course, each extractor can use the parameters are not the same, this article is not listed in detail, in future articles, we will detail the parameters of each extractor and corresponding extract content.
Toxy Advanced Extension Features
Toxy In addition to providing the basic extraction function, but also provides some advanced object transformation Services, such as Toxyspreadsheet to the dataset, you can directly convert Excel data into a dataset for easy invocation and processing. The code is super simple, as shown below:
ParserContext c=new ParserContext (@ "C:\employee.xls"); var parser=parserfactory.createspreadsheet (c); var spreadsheet= parser. Parse ();D ataset ds = spreadsheet. Todataset ();
This is 4 lines of code, is not super cool Ah! In addition, Toxyspreadsheet sub-structure toxytable support todatatable operation, the usage is similar, directly call can.
Toxy Functional Outlook
The goal of the Toxy 1.x is to support a sufficient number of file formats and convert them into a unified structure for extraction.
The goal of the Toxy 2.x-3.x is to support the interoperability of similar files (conversion), such as Excel to CSV, Excel to HTML, Word to PDF, and, of course, a bit of a long, slow path.
later. NET camp will not be laughed at by the Java camp to say that even a decent extraction framework is not, the Java camp has Tika, we. NET has Toxy,oh, yeah!
Toxy Beginner's Guide