ASP collects HTML content commonly used code, detailed said regular collection _ thief/Collect

Source: Internet
Author: User
Tags regular expression
First, the principle of acquisition:

The main steps of the acquisition process are as follows:

First, get the content of the collected pages
Second, extract all the data from the acquisition code

First, get the content of the collected pages

I currently have the ASP commonly used to obtain the content of the page to be collected methods:

1, with the ServerXMLHTTP component to obtain data
Copy Code code as follows:

Function GetBody (Weburl)
' Create object
Dim objXmlHttp
Set objxmlhttp=server.createobject ("Msxml2.serverxmlhttp")
' Request file, in asynchronous form
Objxmlhttp.open "Get", Weburl,false
Objxmlhttp.send
While Objxmlhttp.readystate <> 4
Objxmlhttp.waitforresponse 1000
Wend
' Get the results
Getbody=objxmlhttp.responsebody
' Release object
Set objxmlhttp=nothing
End Function
Call Method:
GetBody (Urlf address of the file)

2, or XMLHTTP components to obtain data
Copy Code code as follows:

Function GetBody (Weburl)
' Create object
Set retrieval = CreateObject ("Microsoft.XMLHTTP")
With retrieval
. Open "Get", Weburl, False, "", ""
. Send
GetBody =. Responsebody
End With
' Release object
Set retrieval = Nothing
End Function

Call Method:
GetBody (Urlf address of the file)

This gets the data content that needs to be encoded to be converted before it can be used
Copy Code code as follows:

Function Bytestobstr (Body,cset)
Dim objstream
Set objstream = Server.CreateObject ("ADODB.stream")
Objstream. Type = 1
Objstream. Mode =3
Objstream. Open
Objstream. Write body
Objstream. Position = 0
Objstream. Type = 2
Objstream. Charset = Cset
Bytestobstr = objstream. ReadText
Objstream. Close
Set objstream = Nothing
End Function

Call Method: Bytestobstr (data to be converted, encoding) ' encoding is commonly used for GB2312 and UTF-8
Second, extract all the data from the acquisition code
At present, I have the following methods:
1, using the ASP built-in mid function to intercept the required data
Copy Code code as follows:

Function Body (wstr,start,over)
Start=newstring (Wstr,start)
' Set the unique start tag for the data that needs to be processed
Over=newstring (Wstr,over)
' and start corresponds to the unique end tag of the data that needs to be processed
Body=mid (Wstr,start,over-start)
' Set the scope of the display page
End Function

Call method: Body (content of the captured page, start tag, end tag)
2, the use of regular access to the required data
Copy Code code as follows:

Function Body (wstr,start,over)
Set Xiaoqi = New Regexp ' Set Configuration object
Xiaoqi. IgnoreCase = True ' ignores case
Xiaoqi. Global = True ' Set to Full-text search
Xiaoqi. Pattern = "&start&". +? " &over& "" ' Regular expression
Set matches =xiaoqi. Execute (WSTR) ' Starts execution configuration
Set xiaoqi=nothing
Body= ""
For the Match in matches
Body=body&match.value ' Loop match
Next
End Function

Call method: Body (content of the captured page, start tag, end tag)
The collection Procedure Xiang fine thought:
1. Get the page Address of Page List page of the website
At present, most dynamic Web site paging address has rules, such as:
Dynamic page
First page: index.asp?page=1
Second page: index.asp?page=2
Third page: index.asp?page=3
.....

Static page
First page: page_1.htm
Second page: page_2.htm
Third page: page_3.htm
.....
To get the page address of the page of the site's paging list, you need to replace the changing characters of each page address with a variable such as: page_<%= "&page&"%>.htm

2. Get the content of the page list page of the collected website
3, from the paging list code to extract the content of the page URL link address
There are also fixed rules for content page connections in most paging pages, such as:
<a href= "URL1" > Connection 1</a> <br>
<a href= "Url2" > Connection 2</a> <br>
<a href= "Url3" > Connection 3</a> <br>

You can get a collection of URL connections with the following code
Copy Code code as follows:

Set Xiaoqi = New Regexp
Xiaoqi. IgnoreCase = True
Xiaoqi. Global = True
Xiaoqi. Pattern = "" ". +?" "" "
Set matches =xiaoqi. Execute (page List contents)
Set xiaoqi=nothing
Url= ""
For the Match in matches
Url=url&match.value
Next

4, get the content of the contents of the content page, according to the "Extract tag" from the content of the collected page to intercept the data to be obtained

Because it's a dynamically generated page, most content pages have the same HTML tag, and we can extract the content of the parts that are needed based on these regular tags.
Such as:
Each page has a page Title <title> page title </title&gt, and the mid intercept function written above can be used to get the value between <title></title>, or it can be obtained using regular expressions.
Example: Body ("<title> page title </title>", "<title>", "</title>")
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.