Use jtidy to convert HTML to XML

Source: Internet
Author: User
Tags tidy

Recently, I want to extract information from the webpage. I want to convert HTML into a standard XML format first, and then use dom4j for subsequent analysis. I have tried many ready-made class libraries, jtidy, nekohtml, HTML Parser, and Jericho are used.

Just R8 snapshot is still just a nightly builds. The previous version of R7 was four years ago, and this project was ruined? Maybe there are too few participants or are they already mature?

Jtidy provides a syntax checker and label compensator that can fix various messy HTML files to conform to the XHTML standard.

Compared with R7, R8 snapshot changes some parameter setting methods. Especially in character encoding, setinputencoding and setoutputencoding are used to determine the character encoding of the input and output files, you can use any valid Java encoding name, which is much better than before.

In general, the parsing results are good, but in some places you still need to manually adjust the generated file, or compile a piece of code to handle it, it is no longer a big problem.

Some common parameter settings:

Setalttext (Java. Lang. String alttext)
Add the default alt attribute value
Setbreakbeforebr (Boolean breakbeforebr)
Add an empty line before line feed <br/>
Setcharencoding (INT charencoding)
Abandoned
Setconfigurationfromfile (Java. Lang. String filename)
Read configuration information from a file
Setconfigurationfromprops (Java. util. properties props)
Read configuration information from Properties
Seterrfile (Java. Lang. String errfile)
Error output file
Setfixbackslash (Boolean fixbackslash)
Replace/with/in the URL/
Setforceoutput (Boolean forceoutput)
Whether the generated XML is incorrect or not, the output is forced.
Sethidecomments (Boolean hidecomments)
No comments are generated in the result.
Setinputencoding (Java. Lang. String encoding)
Input Encoding
Setlogicalemphasis (Boolean logicalemphasis)
Use em to replace I and strong to replace B
Setmessagelistener (tidymessagelistener listener)
Add a tidymessagelistener listener
Setonlyerrors (Boolean onlyerrors)
Only output error files
Setoutputencoding (Java. Lang. String encoding)
Output Encoding
Setprintbodyonly (Boolean bodyonly)
Only the part in the body is output.
Setrepeatedattributes (INT repeatedattributes)
Processing duplicate attributes
Setspaces (INT spaces)
The number of spaces before each row, that is, the indent format
Settidymark (Boolean tidymark)
Whether to generate a tidy mark
Settrimemptyelements (Boolean trimempty)
No empty element output
Setuppercaseattrs (Boolean uppercaseattrs)
Change attribute to uppercase
Setuppercasetags (Boolean uppercasetags)
Uppercase
Setwraplen (INT wraplen)
Long line feed
Setxhtml (Boolean XHTML)
Output XHTML
Setxmlout (Boolean xmlout)
Output XML
Setxmlpi (Boolean xmlpi)
File Header output xml tag
Setxmlspace (Boolean xmlspace)
Add XML namespace attributes

The usage is extremely easy. Define the input and output streams and use tidy. parse () to convert them:

Bufferedinputstream in;

Fileoutputstream out;

Tidy = new tidy ();

Tidy. setconfigurationfromfile (configfilename); // configuration file, which is written to the preceding Setting Parameter

Try {

In = new bufferedinputstream (New fileinputstream (srcfilename ));

Out = new fileoutputstream (outfilename );

Tidy. parse (In, out );

} Catch (ioexception e ){

System. Out. println (E );

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.