Best Web parsing class library htmlcleanner

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Today, we recommend the best web parsing class library-htmlcleaner. At least it is currently the best Java parser library.

It was tied to htmlcleaner at the beginning of the year. Because a job needs to parse HTML pages, I searched for HTML parsing libraries online.

The library HTML Parser is a good reputation on the Internet. I tried it and it was very slow. It took several hundred milliseconds to process a large webpage. what's even worse, some webpages cannot be parsed!

After a great deal of hard work, I finally found the nameless htmlcleanner. At first glance, it's amazing!

Htmlcleanner

Htmlcleaner is extremely short and concise. The source code is only 300 kb, and the speed is amazing. It takes about 10 milliseconds to process HTML pages that require milliseconds of htmlparser processing.

In addition, none of my tests on random web pages on the Internet can solve this problem.

Open the javadoc of htmlcleaner, which is often a string of interfaces and classes. Don't worry. We only need to care about the class htmlcleaner.

The htmlcleaner library is extremely easy to use. You only need to call several methods of the htmlcleaner class.

The typical process is as follows:

Htmlcleaner cleaner = new htmlcleaner (...); // one of few Constructors

Cleaner. setxxx (...) // optionally, Set
Cleaner's behaviour

Clener. Clean (); // CILS
Cleaning Process

The clean method completes parsing the HTML page.

Cleaner. writexmlxxx (...); // writes resulting XML
To string, file or any output stream

//
Cleaner. createdom ();
// Creates Dom of resulting XML

The createdom method of the domserializer class instance can returnOrg. W3C. Dom. Document Object. Yes, this is the representation of the java standard XML document. Then, you can use any third-party library to process XML documents.

//
Cleaner. createjdom ();
// Creates JDOM of resulting XML

You can also generate a JDOM object and use JDOM for processing.

The following is a method I wrote using htmlcleaner.

PublicDocument converthtmltoxml (URL
URL ){

// Create an instance
Htmlcleaner

Htmlcleaner cleaner =NewHtmlcleaner ();

// Take default cleaner
Properties

Cleanerproperties props = cleaner. getproperties ();

Tagnode =Null;

Document document =Null;

Try{

Tagnode = cleaner. Clean (URL );

Document =NewDomserializer (props,True). Createdom (tagnode );

}Catch(Ioexception e ){

//TodoAuto-generated
Catch Block

E. printstacktrace ();

}Catch(Parserconfigurationexception e ){

//TodoAuto-generated
Catch Block

E. printstacktrace ();

}

ReturnDocument;

}

After obtaining the document, you can use XML processing libraries such as dom4j for processing.

Htmlcleanner Implementation Method

You can see from the source code of htmlcleanner that the htmlcleanner design idea is very simple, that is, using regular expressions to parse HTML pages.

Its data container type tagnode is similar to the idea of designing the oxmled library. XML is also a simple tree structure. However, the tagnode class is a little more complicated than my inode interface. I have very few inode interfaces because I don't care about annotations.

My oxmled was updated based on htmlcleaner. In this way, you can use the oxmled library to directly manipulate the HTML page content parsed by htmlcleaner. However, I have been very busy recently and have not finished sorting it out. I will upload the new oxmled version in a few days. The oxmled library is described in

Http://blog.csdn.net/shendl/archive/2007/08/23/1755218.aspx

Host site in: http://sourceforge.net/projects/oxmled/

Summary

The htmlcleanner library is well written. Therefore, a good design idea is better than everything else. It is normal that the open-source libraries are mixed. htmlcleaner is a powerful tool I have found and is now transferred to zjun.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Best Web parsing class library htmlcleanner

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Best Web parsing class library htmlcleanner

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support