Introduction
Today, we recommend the best web parsing class library-htmlcleaner. At least it is currently the best Java parser library.
It was tied to htmlcleaner at the beginning of the year. Because a job needs to parse HTML pages, I searched for HTML parsing libraries online.
The library HTML Parser is a good reputation on the Internet. I tried it and it was very slow. It took several hundred milliseconds to process a large webpage. what's even worse, some webpages cannot be parsed!
After a great deal of hard work, I finally found the nameless htmlcleanner. At first glance, it's amazing!
Htmlcleanner
Htmlcleaner is extremely short and concise. The source code is only 300 kb, and the speed is amazing. It takes about 10 milliseconds to process HTML pages that require milliseconds of htmlparser processing.
In addition, none of my tests on random web pages on the Internet can solve this problem.
Open the javadoc of htmlcleaner, which is often a string of interfaces and classes. Don't worry. We only need to care about the class htmlcleaner.
The htmlcleaner library is extremely easy to use. You only need to call several methods of the htmlcleaner class.
The typical process is as follows:
Htmlcleaner cleaner = new htmlcleaner (...); // one of few Constructors
Cleaner. setxxx (...) // optionally, Set
Cleaner's behaviour
Clener. Clean (); // CILS
Cleaning Process
The clean method completes parsing the HTML page.
Cleaner. writexmlxxx (...); // writes resulting XML
To string, file or any output stream
//
Cleaner. createdom ();
// Creates Dom of resulting XML
The createdom method of the domserializer class instance can returnOrg. W3C. Dom. Document Object. Yes, this is the representation of the java standard XML document. Then, you can use any third-party library to process XML documents.
//
Cleaner. createjdom ();
// Creates JDOM of resulting XML
You can also generate a JDOM object and use JDOM for processing.
The following is a method I wrote using htmlcleaner.
PublicDocument converthtmltoxml (URL
URL ){
// Create an instance
Htmlcleaner
Htmlcleaner cleaner =NewHtmlcleaner ();
// Take default cleaner
Properties
Cleanerproperties props = cleaner. getproperties ();
Tagnode =Null;
Document document =Null;
Try{
Tagnode = cleaner. Clean (URL );
Document =NewDomserializer (props,True). Createdom (tagnode );
}Catch(Ioexception e ){
//TodoAuto-generated
Catch Block
E. printstacktrace ();
}Catch(Parserconfigurationexception e ){
//TodoAuto-generated
Catch Block
E. printstacktrace ();
}
ReturnDocument;
}
After obtaining the document, you can use XML processing libraries such as dom4j for processing.
Htmlcleanner Implementation Method
You can see from the source code of htmlcleanner that the htmlcleanner design idea is very simple, that is, using regular expressions to parse HTML pages.
Its data container type tagnode is similar to the idea of designing the oxmled library. XML is also a simple tree structure. However, the tagnode class is a little more complicated than my inode interface. I have very few inode interfaces because I don't care about annotations.
My oxmled was updated based on htmlcleaner. In this way, you can use the oxmled library to directly manipulate the HTML page content parsed by htmlcleaner. However, I have been very busy recently and have not finished sorting it out. I will upload the new oxmled version in a few days. The oxmled library is described in
Http://blog.csdn.net/shendl/archive/2007/08/23/1755218.aspx
Host site in: http://sourceforge.net/projects/oxmled/
Summary
The htmlcleanner library is well written. Therefore, a good design idea is better than everything else. It is normal that the open-source libraries are mixed. htmlcleaner is a powerful tool I have found and is now transferred to zjun.