Collection Principle:
CollectionProgramThe main steps are as follows:
1. Get the content of the collected page
Ii. Obtain fromCodeExtract all data used in
1. Get the content of the collected page
ASP is commonly used to obtain the content of collected pages:
1. Use the serverxmlhttp component to obtain data
Function getbody (weburl)
'--------------- Xiao zhenkai (Xiao Qi)
'Create object
Dim objxmlhttp
Set objxmlhttp = server. Createobject ("msxml2.serverxmlhttp ")
'Request file, in asynchronous form
Objxmlhttp. Open "get", weburl, false
Objxmlhttp. Send
While objxmlhttp. readystate <> 4
Objxmlhttp. waitforresponse 1000
Wend
'Expected result
Getbody = objxmlhttp. responsebody
'Release the object
Set objxmlhttp = nothing
'--------------- Xiao zhenkai (Xiao Qi)
End Function
Call method: getbody (File URL)
2. Or XMLHTTP component to get data
Function getbody (weburl)
'--------------- Xiao zhenkai (Xiao Qi)
'Create object
Set retrieval = Createobject ("Microsoft. XMLHTTP ")
With Retrieval
. Open "get", weburl, false ,"",""
. Send
Getbody =. responsebody
End
'Release the object
Set retrieval = nothing
'--------------- Xiao zhenkai (Xiao Qi)
End Function
Call method: getbody (File URL)
The obtained data content can be used only after encoding and conversion.
function bytestobstr (body, cset)
'--------------- Xiao zhenkai (Xiao Qi)
dim objstream
set objstream = server. createobject ("ADODB. stream ")
objstream. type = 1
objstream. mode = 3
objstream. open
objstream. write body
objstream. position = 0
objstream. type = 2
objstream. charset = cset
bytestobstr = objstream. readtext
objstream. close
set objstream = nothing
'--------------- Xiao zhenkai (Xiao Qi)
end function
Call method: bytestobstr (the data to be converted, encoding) 'encoding is often gb2312 and UTF-8.
2. Extract all used data from the retrieved code
1. Use the mid function built in ASP to intercept the required data
Function body (wstr, start, over)
'--------------- Xiao zhenkai (Xiao Qi)
Start = newstring (wstr, start)
'Set the unique start mark of the data to be processed
Over = newstring (wstr, over)
'Corresponds to start, which is the unique end mark of the data to be processed.
Body = mid (wstr, start, over-start)
'Set the display page range
'--------------- Xiao zhenkai (Xiao Qi)
End Function
Call method: Body (content of the collected page, start mark, end mark)
2. Use regular expressions to obtain the required data
Function body (wstr, start, over)
'--------------- Xiao zhenkai (Xiao Qi)
Set Xiaoqi = new Regexp 'to set the configuration object
Xiaoqi. ignorecase = true' case insensitive
Xiaoqi. Global = true' is set to full-text search
Xiaoqi. pattern = "" & start & ". + ?" & Over & "" Regular Expression
Set matches = Xiaoqi. Execute (wstr )'
Set Xiaoqi = nothing
Body = ""
For each match in matches
Body = body & Match. value' loop matching
Next
'--------------- Xiao zhenkai (Xiao Qi)
End Function
Call method: Body (content of the collected page, start mark, end mark)
The idea of the collection program is as follows:
1. Retrieve the address of each page on the page list of the website
Currently, the paging addresses of most dynamic websites have rules, such:
Dynamic page
Page 1: index. asp? Page = 1
Page 2: index. asp? Page = 2
Page 3: index. asp? Page = 3
.....
Static Page
Page 1: page_1.htm
Page 2: page_2.htm
Page 3: page_3.htm
.....
Get the address of each page on the page list page of the website. You only need to replace the variable with the character changed in the address of each page, for example, page _ <% = "& page &" %>. htm
2. Retrieve the page content of the collected website
3. Extract [color = Blue] from the code of the page list the URL Connection address of the page for the collected content [/color]
There are also fixed rules for content page connections on most pages, such:
<A href = "url1"> connection 1 </a> <br>
<A href = "url2"> connection 2 </a> <br>
<A href = "url3"> connection 3 </a> <br>
Use the following code to obtain a URL Connection set
'--------------- Xiao zhenkai (Xiao Qi)
Set Xiaoqi = new Regexp
Xiaoqi. ignorecase = true
Xiaoqi. Global = true
Xiaoqi. pattern = "". + ?" "
Set matches = Xiaoqi. Execute (page list content)
Set Xiaoqi = nothing
Url = ""
For each match in matches
Url = URL & Match. Value
Next
'--------------- Xiao zhenkai (Xiao Qi)
4. Obtain the content on the collected content page, and extract the data to be obtained from the collected Content Page Based on the "extraction tag.
Because it is a dynamically generated page, most of the content pages have the same HTML Tag, we can extract the content of each part according to these regular tags. For example:
each page has a webpage title webpage title , use the mid truncation function written above to obtain the value between and the regular expression.
example: Body (" webpage title ", "", "")