Data collection "blueidea 』

Last Update:2018-12-07 Source: Internet

Author: User

Tags file url

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Collection Principle:

CollectionProgramThe main steps are as follows:

1. Get the content of the collected page
Ii. Obtain fromCodeExtract all data used in

1. Get the content of the collected page

ASP is commonly used to obtain the content of collected pages:

1. Use the serverxmlhttp component to obtain data

Function getbody (weburl)
'--------------- Xiao zhenkai (Xiao Qi)
'Create object
Dim objxmlhttp
Set objxmlhttp = server. Createobject ("msxml2.serverxmlhttp ")
'Request file, in asynchronous form
Objxmlhttp. Open "get", weburl, false
Objxmlhttp. Send
While objxmlhttp. readystate <> 4
Objxmlhttp. waitforresponse 1000
Wend
'Expected result
Getbody = objxmlhttp. responsebody
'Release the object
Set objxmlhttp = nothing
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: getbody (File URL)

2. Or XMLHTTP component to get data

Function getbody (weburl)
'--------------- Xiao zhenkai (Xiao Qi)
'Create object
Set retrieval = Createobject ("Microsoft. XMLHTTP ")
With Retrieval
. Open "get", weburl, false ,"",""
. Send
Getbody =. responsebody
End
'Release the object
Set retrieval = nothing
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: getbody (File URL)

The obtained data content can be used only after encoding and conversion.

function bytestobstr (body, cset)
'--------------- Xiao zhenkai (Xiao Qi)
dim objstream
set objstream = server. createobject ("ADODB. stream ")
objstream. type = 1
objstream. mode = 3
objstream. open
objstream. write body
objstream. position = 0
objstream. type = 2
objstream. charset = cset
bytestobstr = objstream. readtext
objstream. close
set objstream = nothing
'--------------- Xiao zhenkai (Xiao Qi)
end function

Call method: bytestobstr (the data to be converted, encoding) 'encoding is often gb2312 and UTF-8.

2. Extract all used data from the retrieved code

1. Use the mid function built in ASP to intercept the required data

Function body (wstr, start, over)
'--------------- Xiao zhenkai (Xiao Qi)
Start = newstring (wstr, start)
'Set the unique start mark of the data to be processed
Over = newstring (wstr, over)
'Corresponds to start, which is the unique end mark of the data to be processed.
Body = mid (wstr, start, over-start)
'Set the display page range
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: Body (content of the collected page, start mark, end mark)

2. Use regular expressions to obtain the required data

Function body (wstr, start, over)
'--------------- Xiao zhenkai (Xiao Qi)
Set Xiaoqi = new Regexp 'to set the configuration object
Xiaoqi. ignorecase = true' case insensitive
Xiaoqi. Global = true' is set to full-text search
Xiaoqi. pattern = "" & start & ". + ?" & Over & "" Regular Expression
Set matches = Xiaoqi. Execute (wstr )'
Set Xiaoqi = nothing
Body = ""
For each match in matches
Body = body & Match. value' loop matching
Next
'--------------- Xiao zhenkai (Xiao Qi)
End Function

Call method: Body (content of the collected page, start mark, end mark)

The idea of the collection program is as follows:

1. Retrieve the address of each page on the page list of the website
Currently, the paging addresses of most dynamic websites have rules, such:
Dynamic page
Page 1: index. asp? Page = 1
Page 2: index. asp? Page = 2
Page 3: index. asp? Page = 3
.....

Static Page
Page 1: page_1.htm
Page 2: page_2.htm
Page 3: page_3.htm
.....

Get the address of each page on the page list page of the website. You only need to replace the variable with the character changed in the address of each page, for example, page _ <% = "& page &" %>. htm

2. Retrieve the page content of the collected website

3. Extract [color = Blue] from the code of the page list the URL Connection address of the page for the collected content [/color]
There are also fixed rules for content page connections on most pages, such:
<A href = "url1"> connection 1 </a> <br>
<A href = "url2"> connection 2 </a> <br>
<A href = "url3"> connection 3 </a> <br>

Use the following code to obtain a URL Connection set

'--------------- Xiao zhenkai (Xiao Qi)
Set Xiaoqi = new Regexp
Xiaoqi. ignorecase = true
Xiaoqi. Global = true
Xiaoqi. pattern = "". + ?" "
Set matches = Xiaoqi. Execute (page list content)
Set Xiaoqi = nothing
Url = ""
For each match in matches
Url = URL & Match. Value
Next
'--------------- Xiao zhenkai (Xiao Qi)

4. Obtain the content on the collected content page, and extract the data to be obtained from the collected Content Page Based on the "extraction tag.

Because it is a dynamically generated page, most of the content pages have the same HTML Tag, we can extract the content of each part according to these regular tags. For example:

each page has a webpage title webpage title , use the mid truncation function written above to obtain the value between and the regular expression.
example: Body (" webpage title ", "", "")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More