March 01, 2004
This article does not want to analyze the HTML syntax and parse the data from it. This implementation is difficult and has no practical significance, or it should be said as follows: we do not want to implement an HTML syntax analyzer by ourselves. All we need to do is extract the information we need from HTML. Unlike XML, a markup language with strict format requirements, HTML does not strictly define its format when it is launched. For example, tags in HTML do not necessarily appear in pairs, however, the browser is required to correctly display the content to be expressed as much as possible. After years of development, browsers have become increasingly adaptable, and many poorly formatted HTML files can be displayed satisfactorily. However, if we need to precisely obtain the data contained in HTML, this is more troublesome than displaying an HTML. Now we have finally found a specific medicine to treat headaches!
Next, we will introduce how to use Java to easily and quickly obtain the data contained in HTML. We will use some existing mature APIs to do this, because if we only want to implement this function, there is no need to implement the HTML syntax analyzer on our own. We introduced an open-source project, HTML Parser, which is one of the more active projects on sourceforge.net. The latest version is the 1.4 release. Like the self-introduction of the HTML Parser Website: HTML Parser is a fast and real-time parser that analyzes existing HTML, in fact, you are even more amazed at the application process by the thoughtful processing of HTML Parser.
Since many readers do not know the application of this project, I will organize this article according to the previous steps. First, I would like to ask a question about how I need to parse the HTML syntax. Second, I would like to analyze the problem and consider how to use HTML Parser to achieve what I need; finally, solve the problem.
Raise Questions
When developing a content management project, the content is stored and written in HTML format, and a browser-based wysiwyg html editor is provided. Users often copy formatted content from other websites and publish the content directly. The homepage displays the summary of the content. The summary information is directly cut and the first few words are read from the content. This leads to the problem that the truncated content contains incomplete format information because the cut length is fixed. For example, I use the first 400 words for the summary display, at this time, some tables in the content are often cut off. The problem caused by the incomplete format information is that it will damage the layout of the entire page, the most common problem we encounter is that the page is enlarged. From the output page source code, we can see that this is caused by these incomplete table tags.
Analyze problems
The solution to the problem we mentioned above is to make content cutting more intelligent and can automatically process such as table cutting, the culprit affecting the format. The previous method was to search the table tags that were not normally ended in the cut content, which solved the problem that most of the content could not be displayed normally, but this only handled the simplest situation, once a nested table is cut, we can't do anything about it. If we try to process the nested table by ourselves, there are many issues to consider, because in order to make the page layout beautiful, web designers use a large number of tables for processing. In fact, this is the only way to compile HTML. Therefore, you may need to process a variety of table nesting methods separately. Imagine writing 10 thousand lines of code to complete a seemingly simple problem, your leader must be in a hurry with you ^ _ ^. The problem is that your 10 thousand lines of code may not solve the problem. Therefore, we should try to adopt mature and stable APIs to solve this problem. HTML Parser is an open-source project used to parse HTML text, it can accurately and efficiently process the format and data in HTML text. Nearly 20 engineers from around the world are working for this project.
The HTML Parser project can be used in the following two aspects:
1. Information Extraction
- Text Information Extraction, such as searching for valid HTML Information
- Link extraction, used to automatically add a link label to the link text of the page
- Resource extraction, such as processing images and sounds
- Link check to check whether the link in HTML is valid
- Page Content Monitoring
2. Information Conversion
- Link rewriting, used to modify all hyperlinks on the page
- Copy webpage content to save webpage content locally
- The content test can be used to filter unpleasant words on the webpage.
- Clean HTML information and format the messy HTML Information
- Convert data to XML format
HTML Parser does not specifically process some of the applications mentioned above, but it is fully qualified for the functions mentioned above, in practical applications, you can use this project to handle the problems mentioned above.
Solve the problem
Next, I will solve the problem of page truncation that we mentioned earlier. My approach is to forcibly intercept HTML content, and then pass the truncated content to HTML Parser to complete missing tags. In this way, some content may not be displayed completely, but at least the page layout will not be damaged. This simple example shows the basic structure and use process of HTML Parser.
Download the HTML Parser package from the SourceForge website (refer to the reference section at the end of the document for the download URL), which is the directory structure after decompression, the red lines are the jar package files we need. Add this file to the project's class path. Other classes can be ignored!
/*** Get the preview information of HTML, where content is an attribute of the object, that is, the HTML content to be processed * @ return */Public String getpreviewcontent () {// capture the first n Characters String Ct = stringutils. left (content, max_count); // fill in some unfinished tags first to avoid the appearance of tags such as <tab if (CT! = NULL & content! = NULL) {int idx2 = CT. lastindexof ('>'); int idx1 = CT. lastindexof ('<'); If (idx2 =-1 & idx1> = 0) | idx1> idx2) {string CT2 = content. substring (Ct. length (); int idx3 = ct2.indexof ('>'); If (idx3! =-1 & idx3 <(MAX_COUNT2-MAX_COUNT) {CT + = content. substring (Ct. length (), CT. length () + idx3 + 1) ;}}// pre-process if (CT! = NULL & content! = NULL) {int idx2 = CT. tolowercase (). lastindexof ("</Object>"); int idx1 = CT. tolowercase (). lastindexof ("<object"); If (idx2 =-1 & idx1> = 0) | idx1> idx2) {string CT2 = content. substring (Ct. length ()). tolowercase (); int idx3 = ct2.indexof ("</Object>"); If (idx3! =-1) CT + = content. substring (Ct. length (), CT. length () + idx3 + 9); else Ct = CT. substring (0, idx1) ;}} if (CT! = NULL & content! = NULL) {parser = parser. createparser (new string (Ct. getbytes (), iso8859_1 ));
// The Chinese information must be transcoded before passing in node [] tables = parser. extractallnodesthatare (tabletag. Class); If (tables! = NULL & tables. lengtd> 0) {tabletag = (tabletag) tables [0]; Ct = CT. substring (0, tabletag. getstartposition () + new string (tabletag. tohtml (). getbytes (iso8859_1); // convert the processed data back to GBK encoding} return CT ;} |
The above code is used to display the HTML summary information. The bold part is the process of processing with HTML Parser. Parser class is the portal of HTML Parser. We can pass HTML text information to it, or directly pass a URL address, as shown below:
Parser parser = new Parser("http://www.javayou.com"); |
In the extends parser. tags package. This method allows us to easily process other types of labels by inputting different classes based on the tags we want to process. Each element in the returned array is an instance of your input class. Through this instance, you can access the starting and ending tags of the current tag and the text information contained in the tag, at the same time, you can access its parent tag and all sub-tags. At the same time, you can use the tohtml method to clean the HTML information contained in the tag, HTML Parser will automatically add tags that are not closed, so that the generated string contains the complete format control information, the display of such information on the page does not damage the layout, which has achieved the expected results.
To make the execution more intuitive, let's take a small example and attach the execution result:
Public static void main (string [] ARGs) throws exception {// incomplete HTML Format String html = "We are pests <Table> 1234567890 <Table> LK Hello China "; parser = parser. createparser (new string (HTML. getbytes (), "8859_1"); node [] tables = parser. extractallnodesthatare (tabletag. class); For (INT I = 0; I <tables. length; I ++) {tabletag = (tabletag) tables [I]; // print out the unknown system where the end tag is located. out. println ("End pos:" + tabletag. getendtag (). getendposition (); // complete the uncompleted tag and print the system. out. println (new string (tabletag. tohtml (). getbytes ("8859_1 ")));}} |
This Code aims to find all the table tags in an incomplete HTML section and print the formatted HTML information, which is the execution result in the eclipse environment.
To better apply the HTML Parser project to the actual business, HTML Parser provides several examples for processing the implementation of the features we mentioned earlier. In these examples, the bin in the decompressed directory has batch processing commands that can be executed. You can import the URL or HTML file path to the Command during execution.
The HTML Parser project only provides us with a simple and robust API for analyzing HTML text information. More application modes need to be explored by ourselves, I hope this article will introduce you to the HTML Parser portal.
References
- HTML Parser project homepage ???? Http://htmlparser.sourceforge.net/
- ???? Http://sourceforge.net/projects/htmlparser
- Address: http://www.ibm.com/developerworks/cn/java/l-html-parser/