Common codes used to collect HTML content by asp.

Source: Internet
Author: User
Tags file url

Let's talk about the collection principle first:

The main steps of the collection program are as follows:

1. Get the content of the collected page
2. Extract all used data from the retrieved code

1. Get the content of the collected page

ASP is commonly used to obtain the content of collected pages:

1. Use the serverXMLHTTP component to obtain dataCopy codeThe Code is as follows: Function GetBody (weburl)
'Create object
Dim ObjXMLHTTP
Set ObjXMLHTTP = Server. CreateObject ("MSXML2.serverXMLHTTP ")
'Request file, in asynchronous form
ObjXMLHTTP. Open "GET", weburl, False
ObjXMLHTTP. send
While ObjXMLHTTP. readyState <> 4
ObjXMLHTTP. waitForResponse 1000
Wend
'Expected result
GetBody = ObjXMLHTTP. responseBody
'Release the object
Set ObjXMLHTTP = Nothing
End Function
Call method:
GetBody (File URL)

2. Or XMLHTTP component to get dataCopy codeThe Code is as follows: Function GetBody (weburl)
'Create object
Set Retrieval = CreateObject ("Microsoft. XMLHTTP ")
With Retrieval
. Open "Get", weburl, False ,"",""
. Send
GetBody =. ResponseBody
End
'Release the object
Set Retrieval = Nothing
End Function

Call method:
GetBody (File URL)

The obtained data content can be used only after encoding and conversion.Copy codeThe Code is as follows: Function BytesToBstr (body, Cset)
Dim objstream
Set objstream = Server. CreateObject ("adodb. stream ")
Objstream. Type = 1
Objstream. Mode = 3
Objstream. Open
Objstream. Write body
Objstream. Position = 0
Objstream. Type = 2
Objstream. Charset = Cset
BytesToBstr = objstream. ReadText
Objstream. Close
Set objstream = nothing
End Function

Call method: BytesToBstr (the data to be converted, encoding) 'encoding is commonly GB2312 and UTF-8
2. Extract all used data from the retrieved code
Currently, I have mastered the following methods:
1. Use the MID function built in ASP to intercept the required dataCopy codeThe Code is as follows: Function body (wstr, start, over)
Start = Newstring (wstr, start)
'Set the unique start mark of the data to be processed
Over = Newstring (wstr, over)
'Corresponds to start, which is the unique end mark of the data to be processed.
Body = mid (wstr, start, over-start)
'Set the display page range
End Function

Call method: body (content of the collected page, start mark, end mark)
2. Use regular expressions to obtain the required dataCopy codeThe Code is as follows: Function body (wstr, start, over)
Set xiaoqi = New Regexp 'to Set the configuration object
Xiaoqi. IgnoreCase = true' case insensitive
Xiaoqi. Global = true' is set to full-text search
Xiaoqi. Pattern = "" & start & ". + ?" & Over & "" Regular Expression
Set Matches = xiaoqi. Execute (wstr )'
Set xiaoqi = nothing
Body = ""
For Each Match in Matches
Body = body & Match. value' loop matching
Next
End Function

Call method: body (content of the collected page, start mark, end mark)
The idea of the collection program is as follows:
1. Retrieve the address of each page on the page list of the website
Currently, the paging addresses of most dynamic websites have rules, such:
Dynamic page
Page 1: index. asp? Page = 1
Page 2: index. asp? Page = 2
Page 3: index. asp? Page = 3
.....

Static Page
Page 1: page_1.htm
Page 2: page_2.htm
Page 3: page_3.htm
.....
Get the address of each page on the page list page of the website. You only need to replace the variable with the character changed in the address of each page, for example, page _ <% = "& page &" %>. htm

2. Retrieve the page content of the collected website
3. Extract the URL Connection address of the collected content page from the page list Code
There are also fixed rules for content page connections on most pages, such:
<A href = "url1"> connection 1 </a> <br>
<A href = "url2"> connection 2 </a> <br>
<A href = "url3"> connection 3 </a> <br>

Use the following code to obtain a URL Connection setCopy codeThe Code is as follows: Set xiaoqi = New Regexp
Xiaoqi. IgnoreCase = True
Xiaoqi. Global = True
Xiaoqi. Pattern = "". + ?" "
Set Matches = xiaoqi. Execute (page list content)
Set xiaoqi = nothing
Url = ""
For Each Match in Matches
Url = url & Match. Value
Next

4. Obtain the content on the collected content page, and extract the data to be obtained from the collected content page according to the "extraction mark ".

Because it is a dynamically generated page, most of the content pages have the same html Tag, we can extract the content of each part according to these regular tags.
For example:
Each page has a webpage title <title> webpage title </title>. The value between <title> </title> can be obtained using the MID truncation function written above, you can also use a regular expression.
Example: body ("<title> webpage title </title>", "<title>", "</title> ")

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.