Using XMLHTTP to write Web Capture programs _ Thieves/Collection

Source: Internet
Author: User
Tags character set chr file url
The version of the blur with syntax coloring: http://gwx.showus.net/blog/article.asp?id=229

Original very hard, reproduced please indicate the original link: http://gwx.showus.net/blog/article.asp?id=229

Web Capture program? Web Crawl program? The Little Helen program? In any case, the application of this program is quite wide. This article does not discuss the copyright or moral issues arising from the use of this procedure, only the implementation of this procedure in the ASP+VBSCRIPT environment:-)

Preliminary knowledge: In addition to the general asp+vbscript knowledge, you also need to understand XMLHTTP objects and regular expression objects. XMLHTTP object is the protagonist of the current popular Ajax, and learn the regular expression, you no longer have to deal with complex strings worry.

The RegEx gadget is useful when writing and debugging regular expressions.

Directory
Crawl a remote Web page and save it to a local
Improvement: Handling garbled
Also download pictures of remote Web pages (and other files)
Improvements: Detecting Real URLs
Improvements: Avoid duplicate downloads
Examples of actual combat (take * * * For example)
Analysis List Page
Tips for content pages
Analyze the previous page in the content page, the next page
Advanced Topics: Transformations of UTF-8 and GB2312
More advanced topics: Crawling after landing, client forgery
Your own collection program.
Original link: http://gwx.showus.net/blog/article.asp?id=229

1. Crawl a remote Web page and save it to a local
' Procedure for debugging, which will be called multiple times to check for intermediate results
Dim Indebug:indebug=true
Sub D (STR)
If indebug = False Then Exit Sub
Response.Write ("<div style= ' color: #003399; Border:solid 1px #003399; Background: #EEF7FF; margin:1px; font-size:12px; padding:4px; ' > ")
Response.Write (Str & "</div>")
Response.Flush ()
End Sub

' Process: Save2file
' Function: Save text or byte stream as a file
' Parameters: scontent what to save
' Sfile save to file, form like ' files/abc.htm '
' Btext is text
' Boverwrite Overwrite existing files
Sub Save2file (Scontent,sfile,btext,boverwrite)
Call D ("Save2file:" +sfile+ "* whether text:" &btext)
Dim saveoption,typeoption
If (boverwrite = True) Then saveoption=2 Else saveoption=1
If (Btext = True) Then typeoption=2 Else typeoption=1
Set Ads = Server.CreateObject ("ADODB.stream")
With Ads
. Type = Typeoption
. Open
If (Btext = True) Then. WRITETEXT scontent Else. Write scontent
. SaveToFile Server.MapPath (sfile), SaveOption
. Cancel ()
. Close ()
End With
Set ads=nothing
End Sub

The key function
' Function: Myhttpget
' Function: Crawl a remote file (Web page or picture, etc.) and save it to local
' parameter: sURL the URL of the remote file
' Btext is Text (Web page), download remote picture is Btext=false
' Back: Crawl content
Function Myhttpget (Surl,btext)
Call D ("<font color=red>myhttpget:</font>" +surl+ "* whether text:" &btext)
' Set oxml = Server.CreateObject (' microsoft.xmlhttp ')
Set oXML = Server.CreateObject ("MSXML2. ServerXMLHTTP ")" Server version of the XMLHTTP component
' To understand the following, you can refer to the Msxml2.serverxmlhttp in MSDN
With oXML
. Open "Get", Surl,false
. Send
While. ReadyState <> 4 ' Wait download complete
. waitforresponse 1000
Wend
If Btext = True Then
Myhttpget = Bytes2bstr (. responsebody)
Else
Myhttpget =. Responsebody
End If
End With
Set oXML = Nothing
End Function

Improvement: Handling garbled
Direct read of the server returned Chinese content will appear garbled, the role of the bytes2bstr referenced in the Myhttpget function is the correct reading of the server returned files in double-byte text (for example, Chinese)
' Myhttpget helper handles double byte text
Function Bytes2bstr (vIn)
Strreturn = ""
For i = 1 to LenB (vIn)
Thischarcode = AscB (MidB (vin,i,1))
If Thischarcode < &h80 Then
Strreturn = Strreturn & Chr (Thischarcode)
Else
Nextcharcode = AscB (MidB (vin,i+1,1))
Strreturn = Strreturn & Chr (CLng (thischarcode) * &h100 + CInt (nextcharcode))
i = i + 1
End If
Next
Bytes2bstr = Strreturn
End Function

The function of the BYTES2BSTR function can also be implemented using the ADODB.stream component through the following function, although the following function can specify the character set charset, but it does not convert the encoding, that is, pass "UTF-8" to the parameter Scset, To read a GB2312 encoded Web page will appear as garbled.
' Charsethelper can correctly read files encoded in Scset (such as "GB2312", "UTF-8", etc.)
Function Charsethelper (Arrbytes,scset)
Call D ("Charsethelper:" +scset)
Dim Oados
Set Oados = CreateObject ("ADODB.stream")
With Oados
. Type = 1
. Mode =3 ' adModeReadWrite
. Open
. Write arrbytes
. Position = 0
. Type = 2
. Charset = Scset
Charsethelper =. ReadText
. Close
End With
Set Oados = Nothing
End Function

2. Download pictures (and other files) of the Remote Web page at the same time
' Function: Processremoteurl
' Function: Replace the remote file in the string with a local file and save the remote file
' Parameter: Strcontent the string to be replaced, that is, the contents of the Remote Web page file
' Ssavepath indicates a local save path to a remote file without a/trailing relative path
' spreceding the changed URL prefix, such as yun_qi_img/abc.jpg "/> and <a href="/upload/abc.gif "... , to handle these relative links, we can first use the following function to convert the relative links in the Web page to absolute links
' Function: Detecturl
' Function: The remote file relative path in the replacement string is http://.. Absolute path at beginning
' Parameter: scontent the text content of a Web page with a relative path to be processed
' sURL the URL of the Remote Web page itself, used to parse the relative path
' Return: Replaces the new page text content after the relative link is an absolute link
Function Detecturl (Scontent,surl)
Call D ("Detecturl:" &surl)

' Parse URL
Dim Re,smatch
Set re=new REGEXP
Re. Multiline=true
Re. IgnoreCase =true
Re. Global=true

Re. Pattern = "(http://[-a-z0-9.] +)/[-a-z0-9+&@#%~_|!:,.;/] +/"
Dim Shost,spath
' Http://localhost/get/sample.asp
Set Smatch=re. Execute (sURL)
' http://localhost
Shost=smatch (0). Submatches (0)
' http://localhost/get/
Spath=smatch (0)

Re. Pattern = "(src|href) =" "? ((?! http://) [-a-z0-9+&@#%=~_|!:,.;/] +)""?"
Set remotefile = Re. Execute (scontent)

' RemoteFile a set of regular expression match objects
' Remotefileurl regular expression match object, shaped like src= "upload/a.jpg"
Dim Sabsoluteurl
For each remotefileurl in RemoteFile
' ,,
If Left (remotefileurl.submatches (1), 1) = "/" Then
Sabsoluteurl=shost
Else
Sabsoluteurl=spath
End If
Sabsoluteurl = remotefileurl.submatches (0) & "=" "" &sabsoluteurl&remotefileurl.submatches (1) & "" "
Scontent=replace (Scontent,remotefileurl,sabsoluteurl)
Next

Detecturl=scontent
End Function
Improvements: Avoid duplicate downloads
Some of the pictures in the Web page, for example, repeat spacer.gif, will be repeated download, the wall is a way to avoid this problem is to set a Arrurls array, the collection of file URL in the inside, before each collection of the array to see whether it has been collected, and then only the parameter set without the parameters of the file

3. Examples of actual combat (take * * * For example)
Is where I most often go, and the speed is good, take her for example, no malice oh:-)

Analysis List Page
Tips for content pages
Analyze the previous page in the content page, the next page
Think for a moment, this part of the content or complexion not write, lest be BS:-), also save a dozen words. It is simply a collection of remote Web pages, and then use regular expression analysis to extract specific content, such as title, author, content and so on I have two small experience:

First, the Web page before and after the content of the analysis has a lot of interference, you can use the following methods to support it in addition to
' Extract part of the content for analysis, you can use the EditPlus number
' Remove the top 7600 and after 5000 characters
Spagew=left (Spagew,len (Spagew)-5000)
Spagew=mid (spagew,7600)

The second is that you may not want to leave a continuous browsing record on each other's servers, and a small function below will help
' Process: Sleep
' Function: Program stops at this complexion for a few seconds
' Parameter: iseconds number of seconds to pause
Sub Sleep (iseconds)
D Timer () & "<font color=blue>sleep for" &iSeconds& "seconds</font>"
Dim T:t=timer ()
while (Timer () <t+iseconds)
' Do nothing
Wend
D Timer () & "<font color=blue>sleep for" &iSeconds& "Seconds ok</font>"
End Sub

' Call example, complexion stop, time is random, within 3 seconds
Sleep (Fix (Rnd () *3))

The third is to use regular expression test tools to improve the efficiency of writing regular expressions.

4. Advanced Topics: Conversion of UTF-8 and GB2312
This problem is more complicated, because I do not fully handle the reasons for intelligence and energy, the information on the Internet is mostly not completely correct or incomplete, I recommend a UTF-8 and GB2312 conversion of the C language implementation for your reference, it is fully functional and does not rely on Windows API functions.
I'm trying to implement it with Asp+vbscript, and there are some less-mature experiences:

The file on the computer, the string representation inside the operating system is Unicode, so the conversion between UTF-8 and GB2312 needs to be mediated by Unicode
UTF-8 is a variant of Unicode, the conversion between them is relatively simple, refer to the following figure on the
GB2312 and Unicode encodings seem to be irrelevant, and a coded mapping table is needed to convert without relying on internal operating system functions, indicating the corresponding relationship between GB2312 and Unicode encoding one by one, which contains approximately 7480x2 items.
In an ASP file, to default to read a string with a and encoding (such as GB2312), you need to set the ASP's codepage to the appropriate code page (for GB2312 is codepage=936)
There are some small and important problems in the code conversion I don't know,:-(.
5. More advanced topics: Landing after crawling, client forgery, etc.
The XMLHTTP object can interact with the HTTP server with a post or get method, can set up and read HTTP headers, learn the HTTP protocol, and learn more about the methods and properties of some XMLHTTP objects, and you can use it to simulate a browser. Automatically do all kinds of repetitive work that people need to do before.

6. Own acquisition procedures
This paper aims to discuss the implementation of the acquisition program in the ASP+VBSCRIPT environment, if you need a Web Capture program, the following links may be useful to you.

Locoyspider Locomotive Web content Collector
c#+. Net written content collector, it is an important feature is not to save the collected content to the database, but the use of customized post submitted by other pages, such as the content management system of the new content page. Free.
Beecollector (Small bee Collector)
The content collector written by Php+mysql.
Wind Content Management System
This powerful content management system is equipped with an ASP Web content collector

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.