Comparison of HTML Parser of several open sources

Source: Internet
Author: User
Comparison of HTML Parser of several open sources
Is-15:48:46-categories:Java

Htmlparser
First, I saw the package downloaded from SourceForge. It was really big and scary. A small HTML Parser actually had 5 MB. After downloading the file, expand the split file and other messy parts, the source is not small. After ant build, generate two jars, htmlparser. Jar (200 K) and htmllexer. Jar (56 K ). I am concerned with the analysis of HTML files, so I only care about parser. After a try, it seems that the independent htmlparser. jar can be used without the dependent libraries in the lib directory. The class structure is clear and detailed. The source directory contains several samples, which are relatively simple and easy to understand. Similar to the use of XML parser, there is also an event Driver Interface, expansion is also easy to generate DOM tree, easy to get started.

Jericho
A simple and small HTML Parser,ProgramThe package is relatively small, about KB, And the jar package built is 40 kb, which is much smaller than the preceding HTML Parser. In terms of usage, Jericho does not provide interfaces similar to Sax and does not focus on the detailed structure. The core concept of Jericho is segment, a tag, and a segment of content. At this level, it is starttag, endtag, and so on. After reading the sample provided by Jericho, it is also very simple. However, people who are familiar with the XML processing method will not get used to it, I think.Source codeThe quality is average, and the HTML Parser does not look good.

Nekohtml
This is an xni Interface Based on Apache xerces-J and relies on xerces-J. If you think of something as big as xerces-J, you will get angry and give up.

Java HTML Parser
In addition to the download connection, there is no more information on the home page. It is also quite messy and has not been tried.

Tagsoup
The download source link on the home page is disconnected and I sent a letter to the author. I quickly replied, saying that the link has been fixed. The compiled jar package is 30 kb, which is short and concise. Because the coreCodeTemplate generation is required, so normal compilation can only be performed in a Perl environment. No documentation, no simple sample, reading source, some dizzy, I feel more suitable for the compilation principle syntax analysis and state machine demonstration materials.
BTW: on the home page, the handler interface of tagsoup is very similar to that of sax, but it is completely compatible without making it clear.

Leave a comment-trackback (0) xmlns: DC = "http://purl.org/dc/elements/1.1/"
xmlns: trackback = "http://madskills.com/public/xml/rss/module/trackback/">
RDF: About = "http://was.io8.org/me/tech/2004/08/01/p275"
DC: identifier = "http://was.io8.org/me/tech/2004/08/01/p275"
DC: Title = "comparison of several open source HTML Parser"
trackback: ping = "http://was.io8.org/me/htsrv/trackback.php/275"/>
-->- permalink

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.