The main problem for extracting content from HTML pages is that we must find a way to precisely identify the part of content we want. For example, the following is an HTML that displays the news title.CodeParts:
<Table border = "0" width = "11%" class = "somestory">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> other content... </TD>
</Tr>
</Table>
<Table border = "0" width = "11%" class = "headline">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> Iraq war! </TD>
</Tr>
</Table>
<Table border = "0" width = "11%" class = "someotherstory">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> other content... </TD>
</Tr>
</Table>
? By observing the code above, it is easy to see that the news title is displayed in the middle table, and its class attribute is set to headline. If the HTML page is very complex, you can use an additional function provided by Microsoft IE from 5.0 to view only the HTML code of the selected page. Visit http://www.microsoft.com/windows/ie/webaccess/default.aspfor details. In this example, we assume that this is the only table with the class attribute set to headline. Now we want to create a regular expression, use the regular expression to find the headline table and include it in our own page. First, write code that supports regular expressions:
<%
Dim re, strhtml
Set Re = new Regexp 'create a regular expression object
Re. ignorecase = true
Re. Global = false' end search after the first match
%>
Next, let's consider the region we want to extract: Here, we want to extract the entire <Table> structure, including the text of the ending mark and news title. Therefore, the start character of the search should be <Table> Start mark: Re. pattern = "<Table .*(? = Headline )". This regular expression matches the Start mark of the table and returns all content between the start mark and "headline" (except for line breaks ). The following shows how to return HTML code matching:
'Put all matching HTML code into the matches set.
Set matches = Re. Execute (strhtml)
'Display all matching HTML code
For each item in matches
Response. Write item. Value
Next
'Show one of the items
Response. Write matches. Item (0). Value
Run this code to process the HTML snippet shown above. The regular expression returns the following Matching content: <Table border = "0" width = "11%" class = ". In the regular expression, "(? = Headline) "does not get characters, so you cannot see the value of the table class attribute. The code for getting the rest of the table is also quite simple: Re. pattern = "<Table .*(? = Headline) (. | \ n )*? </Table> ". "*" After "(. | \ n)" matches 0 to multiple arbitrary characters, while "?" Minimize the "*" matching range, that is, match as few characters as possible before finding the next part of the expression. </Table> indicates the end mark of a table.
"?" It prevents expressions from returning code from other tables. For example, if the preceding HTML code snippet is deleted, The returned content is:
<Table border = "0" width = "11%" class = "headline">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> Iraq war! </TD>
</Tr>
</Table>
<Table border = "0" width = "11%" class = "someotherstory">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> other content... </TD>
</Tr>
</Table>
The returned content not only contains the <Table> mark of the headline table, but also the someotherstory table. We can see that the "?" Is indispensable.
In this example, we assume that some of the premises are quite idealistic. In practice, the situation is often much more complicated. Especially when you have no influence on the writing of source HTML code in use, it is particularly difficult to compile ASP code. The most effective method is to spend more time analyzing the HTML near the content to be extracted and often test it to ensure that the extracted content is exactly what you need. In addition, we should pay attention to and handle situations where regular expressions cannot match any content on the source HTML page. The content update may be very fast. Do not make your pages suffer from low-level and ridiculous errors only because others have changed the content format.
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/wuhuiran/archive/2008/08/01/2750765.aspx