Site generation static page Strategy 2: Data Acquisition

Site generation static page Strategy 2: Data Acquisition _ Thieves/Collection

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Acquisition principle:
The main steps of the     acquisition program are as follows:
    I. Get the content of the captured page
   extract all used data from the get Code
    One, get the content of the captured page
    I currently have the ASP commonly used to obtain the content of the page method:
    1, with the ServerXMLHTTP component to obtain data

function getbody (weburl)
'-----------------Shangai (Xiao qi)
    ' Create object
    Dim objxmlhttp
    Set objxmlhttp=server.createobject (" Msxml2.serverxmlhttp ")
    ' request file, in asynchronous form
    objxmlhttp.open ' get ', weburl,false
    objxmlhttp.send
    while objxmlhttp.readystate <> 4
         objxmlhttp.waitforresponse 1000
    wend
    ' Get results
     getbody=objxmlhttp.responsebody
    ' release objects
      Set objxmlhttp=nothing
'-----------------Shangai (Kiki)
End Function
     Call method: GetBody (Urlf address of the file)
    2, or XMLHTTP component get data

Function GetBody (Weburl)
'-----------------Shangai (Xiao-qi)
' Create object
Set retrieval = CreateObject ("Microsoft.XMLHTTP")
With retrieval
. Open "Get", Weburl, False, "", ""
. Send
GetBody =. Responsebody
End With
' Release object
Set retrieval = Nothing
'-----------------Shangai (Xiao-qi)
End Function
Call method: GetBody (Urlf address of the file)
This gets the data content that needs to be encoded to be converted before it can be used

Function Bytestobstr (Body,cset)
'-----------------Shangai (Xiao-qi)
Dim objstream
Set objstream = Server.CreateObject ("ADODB.stream")
Objstream. Type = 1
Objstream. Mode =3
Objstream. Open
Objstream. Write body
Objstream. Position = 0
Objstream. Type = 2
Objstream. Charset = Cset
Bytestobstr = objstream. ReadText
Objstream. Close
Set objstream = Nothing
'-----------------Shangai (Xiao-qi)
End Function
Call Method: Bytestobstr (data to be converted, encoding) ' encoding is commonly used for GB2312 and UTF-8.
Second, extract all the data from the acquisition code
1, using the ASP built-in mid function to intercept the required data

Function Body (wstr,start,over)
'-----------------Shangai (Xiao-qi)
Start=newstring (Wstr,start)
' Set the unique start tag for the data that needs to be processed
Over=newstring (Wstr,over)
' and start corresponds to the unique end tag of the data that needs to be processed
Body=mid (Wstr,start,over-start)
' Set the scope of the display page
'-----------------Shangai (Xiao-qi)
End Function
Call method: Body (content of the captured page, start tag, end tag)
2, the use of regular access to the required data

function Body (wstr,start,over)
'-----------------Shangai (Xiao Qi)
Set Xiaoqi = New Regexp ' Sets Configuration Object
Xiaoqi. IgnoreCase = True ' ignores case
Xiaoqi. Global = True ' is set to Full-text search
Xiaoqi. Pattern = "" &start& ". +?" &over& ' Regular expression
Set matches =xiaoqi. Execute (WSTR) ' Starts execution configuration
set xiaoqi=nothing
body= '
for each Match in matches
body=body& Match.value ' Loop matches
Next
'-----------------Shangai
End Function
    calling Method: Body (the contents of the captured page, Start tag, end tag)
    sampling procedure:
    1, access to the page of the site's paging list page address
         at present, most of the dynamic Web site's paging address has rules, such as:
Dynamic page
First page: index.asp?page=1
Second page: index.asp?page=2
Third page: index.asp?page=3
...

Static page
First page: page_1.htm
Second page: page_2.htm
Third page: page_3.htm
.....
To get the page address of the page of the site's paging list, you need to replace the changing characters of each page address with a variable such as: page_<%= "&page&"%>.htm
2. Get the content of the page list page of the collected website
3, from the paging list code to extract the content of the page URL link address
There are also fixed rules for content page connections in most paging pages, such as:

<a href= "URL1" > Connection 1</a> <br>
<a href= "Url2" > Connection 2</a> <br>
<a href= "Url3" > Connection 3</a> <br>
You can get a collection of URL connections with the following code

'-----------------Shangai (Xiao-qi)
Set Xiaoqi = New Regexp
Xiaoqi. IgnoreCase = True
Xiaoqi. Global = True
Xiaoqi. Pattern = "" ". +?" "" "
Set matches =xiaoqi. Execute (page List contents)
Set xiaoqi=nothing
Url= ""
For the Match in matches
Url=url&match.value
Next
'-----------------Shangai (Xiao-qi)
4, get the content of the contents of the content page, according to the "Extract tag" from the content of the collected page to intercept the data to be obtained.
Because it's a dynamically generated page, most content pages have the same HTML tag, and we can extract the content of the parts that are needed based on these regular tags. Such as:
Each page has a page Title <title> page title </title>, and the mid intercept function written above can be used to get the value between <title></title>, or it can be obtained using regular expressions.
Cases:

Body ("<title> page title </title>", "<title>", "</title>")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Site generation static page Strategy 2: Data Acquisition _ Thieves/Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Site generation static page Strategy 2: Data Acquisition _ Thieves/Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support