Regular expressions are used to extract specific parts of an HTML page.

Source: Internet
Author: User

The main problem for extracting content from HTML pages is that we must find a way to precisely identify the part of content we want. For example, the following is an HTML that displays the news title.CodeParts:
<Table border = "0" width = "11%" class = "somestory">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> other content... </TD>
</Tr>
</Table>
<Table border = "0" width = "11%" class = "headline">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> Iraq war! </TD>
</Tr>
</Table>
<Table border = "0" width = "11%" class = "someotherstory">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> other content... </TD>
</Tr>
</Table>

? By observing the code above, it is easy to see that the news title is displayed in the middle table, and its class attribute is set to headline. If the HTML page is very complex, you can use an additional function provided by Microsoft IE from 5.0 to view only the HTML code of the selected page. Visit http://www.microsoft.com/windows/ie/webaccess/default.aspfor details. In this example, we assume that this is the only table with the class attribute set to headline. Now we want to create a regular expression, use the regular expression to find the headline table and include it in our own page. First, write code that supports regular expressions:
<%
Dim re, strhtml
Set Re = new Regexp 'create a regular expression object

Re. ignorecase = true
Re. Global = false' end search after the first match
%>
 

 

Next, let's consider the region we want to extract: Here, we want to extract the entire <Table> structure, including the text of the ending mark and news title. Therefore, the start character of the search should be <Table> Start mark: Re. pattern = "<Table .*(? = Headline )". This regular expression matches the Start mark of the table and returns all content between the start mark and "headline" (except for line breaks ). The following shows how to return HTML code matching:

'Put all matching HTML code into the matches set.
Set matches = Re. Execute (strhtml)

'Display all matching HTML code
For each item in matches
Response. Write item. Value
Next

'Show one of the items
Response. Write matches. Item (0). Value
 

Run this code to process the HTML snippet shown above. The regular expression returns the following Matching content: <Table border = "0" width = "11%" class = ". In the regular expression, "(? = Headline) "does not get characters, so you cannot see the value of the table class attribute. The code for getting the rest of the table is also quite simple: Re. pattern = "<Table .*(? = Headline) (. | \ n )*? </Table> ". "*" After "(. | \ n)" matches 0 to multiple arbitrary characters, while "?" Minimize the "*" matching range, that is, match as few characters as possible before finding the next part of the expression. </Table> indicates the end mark of a table.

"?" It prevents expressions from returning code from other tables. For example, if the preceding HTML code snippet is deleted, The returned content is:

<Table border = "0" width = "11%" class = "headline">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> Iraq war! </TD>
</Tr>
</Table>
<Table border = "0" width = "11%" class = "someotherstory">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> other content... </TD>
</Tr>
</Table>
 

 

The returned content not only contains the <Table> mark of the headline table, but also the someotherstory table. We can see that the "?" Is indispensable.

In this example, we assume that some of the premises are quite idealistic. In practice, the situation is often much more complicated. Especially when you have no influence on the writing of source HTML code in use, it is particularly difficult to compile ASP code. The most effective method is to spend more time analyzing the HTML near the content to be extracted and often test it to ensure that the extracted content is exactly what you need. In addition, we should pay attention to and handle situations where regular expressions cannot match any content on the source HTML page. The content update may be very fast. Do not make your pages suffer from low-level and ridiculous errors only because others have changed the content format.

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/wuhuiran/archive/2008/08/01/2750765.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.