Best Web parsing class library htmlcleanner

Source: Internet
Author: User

 Introduction

Today, we recommend the best web parsing class library-htmlcleaner. At least it is currently the best Java parser library.

It was tied to htmlcleaner at the beginning of the year. Because a job needs to parse HTML pages, I searched for HTML parsing libraries online.

The library HTML Parser is a good reputation on the Internet. I tried it and it was very slow. It took several hundred milliseconds to process a large webpage. what's even worse, some webpages cannot be parsed!

After a great deal of hard work, I finally found the nameless htmlcleanner. At first glance, it's amazing!

Htmlcleanner

Htmlcleaner is extremely short and concise. The source code is only 300 kb, and the speed is amazing. It takes about 10 milliseconds to process HTML pages that require milliseconds of htmlparser processing.

In addition, none of my tests on random web pages on the Internet can solve this problem.

Open the javadoc of htmlcleaner, which is often a string of interfaces and classes. Don't worry. We only need to care about the class htmlcleaner.

The htmlcleaner library is extremely easy to use. You only need to call several methods of the htmlcleaner class.

The typical process is as follows:

Htmlcleaner cleaner = new htmlcleaner (...); // one of few Constructors


Cleaner. setxxx (...) // optionally, Set
Cleaner's behaviour


Clener. Clean (); // CILS
Cleaning Process

The clean method completes parsing the HTML page.


Cleaner. writexmlxxx (...); // writes resulting XML
To string, file or any output stream

//
Cleaner. createdom ();
// Creates Dom of resulting XML

The createdom method of the domserializer class instance can returnOrg. W3C. Dom. Document Object. Yes, this is the representation of the java standard XML document. Then, you can use any third-party library to process XML documents.

//
Cleaner. createjdom ();
// Creates JDOM of resulting XML

You can also generate a JDOM object and use JDOM for processing.

 

The following is a method I wrote using htmlcleaner.

PublicDocument converthtmltoxml (URL
URL ){

// Create an instance
Htmlcleaner

Htmlcleaner cleaner =NewHtmlcleaner ();

// Take default cleaner
Properties

Cleanerproperties props = cleaner. getproperties ();

Tagnode =Null;

Document document =Null;

Try{

Tagnode = cleaner. Clean (URL );

Document =NewDomserializer (props,True). Createdom (tagnode );

}Catch(Ioexception e ){

//TodoAuto-generated
Catch Block

E. printstacktrace ();

}Catch(Parserconfigurationexception e ){

//TodoAuto-generated
Catch Block

E. printstacktrace ();

}

ReturnDocument;

}

After obtaining the document, you can use XML processing libraries such as dom4j for processing.

Htmlcleanner Implementation Method

You can see from the source code of htmlcleanner that the htmlcleanner design idea is very simple, that is, using regular expressions to parse HTML pages.

Its data container type tagnode is similar to the idea of designing the oxmled library. XML is also a simple tree structure. However, the tagnode class is a little more complicated than my inode interface. I have very few inode interfaces because I don't care about annotations.

My oxmled was updated based on htmlcleaner. In this way, you can use the oxmled library to directly manipulate the HTML page content parsed by htmlcleaner. However, I have been very busy recently and have not finished sorting it out. I will upload the new oxmled version in a few days. The oxmled library is described in

Http://blog.csdn.net/shendl/archive/2007/08/23/1755218.aspx

Host site in: http://sourceforge.net/projects/oxmled/

Summary

The htmlcleanner library is well written. Therefore, a good design idea is better than everything else. It is normal that the open-source libraries are mixed. htmlcleaner is a powerful tool I have found and is now transferred to zjun.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.