Convert HTML to XML with Jtidy

Source: Internet
Author: User
Tags add format string tidy
xml| Conversion

Recently to extract information from the Web page, want to first convert HTML into a standard XML format, and then facilitate the use of dom4j for subsequent analysis, the trial of a number of ready-made class library, Jtidy, nekohtml, HTML Parser, Jericho, and finally used the jtidy.

Just R8 snapshot is just a nightly builds, the front of the R7 version is more than 4 years ago, the project is so deserted? Is it possible that there are too few people involved or that they are mature?

Jtidy provides a grammar checker and tag compensator that can be fixed to a variety of cluttered HTML to conform to the XHTML standard.

R8 snapshot has changed some parameter setting methods relative to R7, especially in character encoding processing, using the setinputencoding and Setoutputencoding methods to determine the character encoding of input and output files, you can use any valid Java encoding name , it's a lot better than it used to be.

The overall resolution of the results are good, but some places also need to manually adjust the generated files, or to make a piece of code to deal with it, is not a big problem.

  Some common parameter settings:

Setalttext (java.lang.String AltText)
Plus the default Alt property value
Setbreakbeforebr (Boolean Breakbeforebr)
Add a blank line before wrapping <br/>
setcharencoding (int charencoding)
has been discarded
Setconfigurationfromfile (java.lang.String filename)
Reading configuration information from a file
Setconfigurationfromprops (java.util.Properties props)
Read configuration information from properties
Seterrfile (java.lang.String errfile)
Error Output file
Setfixbackslash (Boolean Fixbackslash)
URL use/replace \
Setforceoutput (Boolean forceoutput)
Force output regardless of whether the generated XML is wrong.
Sethidecomments (Boolean hidecomments)
No comments are generated in the result
Setinputencoding (java.lang.String encoding)
Input encoding
Setlogicalemphasis (Boolean logicalemphasis)
Replace I,strong with EM instead of B
Setmessagelistener (Tidymessagelistener Listener)
Join a Tidymessagelistener Listener
Setonlyerrors (Boolean onlyerrors)
Output only error files
Setoutputencoding (java.lang.String encoding)
Output encoding
Setprintbodyonly (Boolean bodyonly)
Output only the parts of the body
setrepeatedattributes (int repeatedattributes)
Processing of duplicate attributes
setspaces (int spaces)
The number of spaces before each line is the indentation format
Settidymark (Boolean Tidymark)
Whether to generate tidy tags
Settrimemptyelements (Boolean Trimempty)
Do not output empty elements
Setuppercaseattrs (Boolean uppercaseattrs)
property to uppercase
Setuppercasetags (Boolean uppercasetags)
Tag capitalization
Setwraplen (int wraplen)
How long to wrap a line
Setxhtml (Boolean XHTML)
Output XHTML
Setxmlout (Boolean xmlout)
Output XML
Setxmlpi (Boolean Xmlpi)
File header output XML tag
Setxmlspace (Boolean XmlSpace)
Add XML Namespace attribute

Use the method extremely easy, defines the input output stream, uses the Tidy.parse () to transform on the line:

Bufferedinputstream in;

FileOutputStream out;

Tidy Tidy = new Tidy ();

Tidy.setconfigurationfromfile (configfilename);//configuration file, writing the setting parameters above

try {

in = new Bufferedinputstream (new FileInputStream (Srcfilename));

out = new FileOutputStream (outfilename);

Tidy.parse (in, out);

catch (IOException e) {

System.out.println (e);

}



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.