Recently, I want to extract information from the webpage. I want to convert HTML into a standard XML format first, and then use dom4j for subsequent analysis. I have tried many ready-made class libraries, jtidy, nekohtml, HTML Parser, and Jericho are used.
Just R8 snapshot is still just a nightly builds. The previous version of R7 was four years ago, and this project was ruined? Maybe there are too few participants or are they already mature?
Jtidy provides a syntax checker and label compensator that can fix various messy HTML files to conform to the XHTML standard.
Compared with R7, R8 snapshot changes some parameter setting methods. Especially in character encoding, setinputencoding and setoutputencoding are used to determine the character encoding of the input and output files, you can use any valid Java encoding name, which is much better than before.
In general, the parsing results are good, but in some places you still need to manually adjust the generated file, or compile a piece of code to handle it, it is no longer a big problem.
Some common parameter settings:
Setalttext (Java. Lang. String alttext)
Add the default alt attribute value
Setbreakbeforebr (Boolean breakbeforebr)
Add an empty line before line feed <br/>
Setcharencoding (INT charencoding)
Abandoned
Setconfigurationfromfile (Java. Lang. String filename)
Read configuration information from a file
Setconfigurationfromprops (Java. util. properties props)
Read configuration information from Properties
Seterrfile (Java. Lang. String errfile)
Error output file
Setfixbackslash (Boolean fixbackslash)
Replace/with/in the URL/
Setforceoutput (Boolean forceoutput)
Whether the generated XML is incorrect or not, the output is forced.
Sethidecomments (Boolean hidecomments)
No comments are generated in the result.
Setinputencoding (Java. Lang. String encoding)
Input Encoding
Setlogicalemphasis (Boolean logicalemphasis)
Use em to replace I and strong to replace B
Setmessagelistener (tidymessagelistener listener)
Add a tidymessagelistener listener
Setonlyerrors (Boolean onlyerrors)
Only output error files
Setoutputencoding (Java. Lang. String encoding)
Output Encoding
Setprintbodyonly (Boolean bodyonly)
Only the part in the body is output.
Setrepeatedattributes (INT repeatedattributes)
Processing duplicate attributes
Setspaces (INT spaces)
The number of spaces before each row, that is, the indent format
Settidymark (Boolean tidymark)
Whether to generate a tidy mark
Settrimemptyelements (Boolean trimempty)
No empty element output
Setuppercaseattrs (Boolean uppercaseattrs)
Change attribute to uppercase
Setuppercasetags (Boolean uppercasetags)
Uppercase
Setwraplen (INT wraplen)
Long line feed
Setxhtml (Boolean XHTML)
Output XHTML
Setxmlout (Boolean xmlout)
Output XML
Setxmlpi (Boolean xmlpi)
File Header output xml tag
Setxmlspace (Boolean xmlspace)
Add XML namespace attributes
The usage is extremely easy. Define the input and output streams and use tidy. parse () to convert them:
Bufferedinputstream in;
Fileoutputstream out;
Tidy = new tidy ();
Tidy. setconfigurationfromfile (configfilename); // configuration file, which is written to the preceding Setting Parameter
Try {
In = new bufferedinputstream (New fileinputstream (srcfilename ));
Out = new fileoutputstream (outfilename );
Tidy. parse (In, out );
} Catch (ioexception e ){
System. Out. println (E );
}