Data collection "blueidea 』

Source: Internet
Author: User
Tags file url

Collection Principle:

CollectionProgramThe main steps are as follows:

1. Get the content of the collected page
Ii. Obtain fromCodeExtract all data used in

1. Get the content of the collected page

ASP is commonly used to obtain the content of collected pages:

1. Use the serverxmlhttp component to obtain data

Function getbody (weburl)
'--------------- Xiao zhenkai (Xiao Qi)
'Create object
Dim objxmlhttp
Set objxmlhttp = server. Createobject ("msxml2.serverxmlhttp ")
'Request file, in asynchronous form
Objxmlhttp. Open "get", weburl, false
Objxmlhttp. Send
While objxmlhttp. readystate <> 4
Objxmlhttp. waitforresponse 1000
Wend
'Expected result
Getbody = objxmlhttp. responsebody
'Release the object
Set objxmlhttp = nothing
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: getbody (File URL)

2. Or XMLHTTP component to get data

Function getbody (weburl)
'--------------- Xiao zhenkai (Xiao Qi)
'Create object
Set retrieval = Createobject ("Microsoft. XMLHTTP ")
With Retrieval
. Open "get", weburl, false ,"",""
. Send
Getbody =. responsebody
End
'Release the object
Set retrieval = nothing
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: getbody (File URL)

The obtained data content can be used only after encoding and conversion.

function bytestobstr (body, cset)
'--------------- Xiao zhenkai (Xiao Qi)
dim objstream
set objstream = server. createobject ("ADODB. stream ")
objstream. type = 1
objstream. mode = 3
objstream. open
objstream. write body
objstream. position = 0
objstream. type = 2
objstream. charset = cset
bytestobstr = objstream. readtext
objstream. close
set objstream = nothing
'--------------- Xiao zhenkai (Xiao Qi)
end function

Call method: bytestobstr (the data to be converted, encoding) 'encoding is often gb2312 and UTF-8.

2. Extract all used data from the retrieved code

1. Use the mid function built in ASP to intercept the required data

Function body (wstr, start, over)
'--------------- Xiao zhenkai (Xiao Qi)
Start = newstring (wstr, start)
'Set the unique start mark of the data to be processed
Over = newstring (wstr, over)
'Corresponds to start, which is the unique end mark of the data to be processed.
Body = mid (wstr, start, over-start)
'Set the display page range
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: Body (content of the collected page, start mark, end mark)

2. Use regular expressions to obtain the required data

Function body (wstr, start, over)
'--------------- Xiao zhenkai (Xiao Qi)
Set Xiaoqi = new Regexp 'to set the configuration object
Xiaoqi. ignorecase = true' case insensitive
Xiaoqi. Global = true' is set to full-text search
Xiaoqi. pattern = "" & start & ". + ?" & Over & "" Regular Expression
Set matches = Xiaoqi. Execute (wstr )'
Set Xiaoqi = nothing
Body = ""
For each match in matches
Body = body & Match. value' loop matching
Next
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: Body (content of the collected page, start mark, end mark)

The idea of the collection program is as follows:

1. Retrieve the address of each page on the page list of the website
Currently, the paging addresses of most dynamic websites have rules, such:
Dynamic page
Page 1: index. asp? Page = 1
Page 2: index. asp? Page = 2
Page 3: index. asp? Page = 3
.....

Static Page
Page 1: page_1.htm
Page 2: page_2.htm
Page 3: page_3.htm
.....

Get the address of each page on the page list page of the website. You only need to replace the variable with the character changed in the address of each page, for example, page _ <% = "& page &" %>. htm

2. Retrieve the page content of the collected website

3. Extract [color = Blue] from the code of the page list the URL Connection address of the page for the collected content [/color]
There are also fixed rules for content page connections on most pages, such:
<A href = "url1"> connection 1 </a> <br>
<A href = "url2"> connection 2 </a> <br>
<A href = "url3"> connection 3 </a> <br>

Use the following code to obtain a URL Connection set

'--------------- Xiao zhenkai (Xiao Qi)
Set Xiaoqi = new Regexp
Xiaoqi. ignorecase = true
Xiaoqi. Global = true
Xiaoqi. pattern = "". + ?" "
Set matches = Xiaoqi. Execute (page list content)
Set Xiaoqi = nothing
Url = ""
For each match in matches
Url = URL & Match. Value
Next
'--------------- Xiao zhenkai (Xiao Qi)

4. Obtain the content on the collected content page, and extract the data to be obtained from the collected Content Page Based on the "extraction tag.

Because it is a dynamically generated page, most of the content pages have the same HTML Tag, we can extract the content of each part according to these regular tags. For example:

each page has a webpage title webpage title , use the mid truncation function written above to obtain the value between and the regular expression.
example: Body (" webpage title ", "", "")

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.