Colors with syntax coloring version: http://gwx.showus.net/blog/article.asp? Id = 229
Original very hard, reprint please indicate the original link: http://gwx.showus.net/blog/article.asp? Id = 229
Web collection program? Web Crawlers? Small program? This type of application is quite wide. This article does not discuss the copyright or moral issues caused by the use of such programs, but only the implementation of such programs in the ASP + VBScript environment.
Prerequisites: in addition to general ASP + VBScript knowledge, you also need to understand xmlhttp objects and regular expression objects. Xmlhttp objects are the main character of popular Ajax. After learning regular expressions, you no longer have to worry about processing complex strings.
The RegEx tool is very useful when writing and debugging regular expressions.
Directory
Capture a remote webpage and save it to a local device
Improvement: handling garbled characters
Download images (and other files) from a remote webpage at the same time)
Improvement: detect real URLs
Improvement: avoid repeated downloads
Practical example (using *** as an example)
Analysis list page
Content Page skills
Previous and next pages on the analysis content page
Advanced theme: Conversion of UTF-8 and GB2312
More advanced topics: captured after login, forged on the client
Existing collection programs
Http://gwx.showus.net/blog/article.asp? Id = 229
1. capture a remote webpage and save it to a local device
'The debugging process will be followed by multiple calls to check intermediate results
Dim inDebug: inDebug = True
Sub D (Str)
If inDebug = False Then Exit Sub
Response. write ("<div style = 'color: #003399; border: solid 1px #003399; background: # EEF7FF; margin: 1px; font-size: 12px; padding: 4px; '> ")
Response. Write (Str & "</div> ")
Response. Flush ()
End Sub
'Process: Save2File
'Function: saves text or byte streams as files
'Parameter: content to be saved by sContent
Save 'sfile to an object, such as "files/abc.htm"
'Whether bText is text
'Boverwrite whether to overwrite existing files
Sub Save2File (sContent, sFile, bText, bOverWrite)
Call D ("Save2File:" + sFile + "* Whether text:" & bText)
Dim SaveOption, tyexception
If (bOverWrite = True) Then SaveOption = 2 Else SaveOption = 1
If (bText = True) Then TypeOption = 2 Else TypeOption = 1
Set Ads = Server. CreateObject ("Adodb. Stream ")
With Ads
. Type = tyexception
. Open
If (bText = True) Then. WriteText sContent Else. Write sContent
. SaveToFile Server. MapPath (sFile), SaveOption
. Cancel ()
. Close ()
End
Set Ads = nothing
End Sub
Key functions
'Function: myHttpGet
'Function: capture a remote file (webpage or image) and save it to a local device.
'Parameter: sUrl URL of the Remote File
'Whether bText is text (webpage), and whether to download a remote image is bText = False
'Return: captured content
Function myHttpGet (sUrl, bText)
Call D ("<font color = red> myHttpGet: </font>" + sUrl + "* text:" & bText)
'Set oXml = Server. CreateObject ("Microsoft. XMLHTTP ")
Set oXml = Server. CreateObject ("MSXML2.ServerXMLHTTP") 'xmlhttp component of Server version
'Understand the following content. You can refer to MSXML2.ServerXMLHTTP in MSDN.
With oXml
. Open "GET", sUrl, False
. Send
While. readyState <> 4' waiting for download to complete
. WaitForResponse 1000
Wend
If bText = True Then
MyHttpGet = bytes2BSTR (. responseBody)
Else
MyHttpGet =. responseBody
End If
End
Set oXml = Nothing
End Function
Improvement: handling garbled characters
Garbled characters may occur when you directly read the Chinese content returned by the server. The bytes2BSTR referenced in the myHttpGet function is used to correctly read the dual-byte text (such as Chinese) in the files returned by the server)
'Myhttpget helper processes two-byte text
Function bytes2BSTR (vIn)
StrReturn = ""
For I = 1 To LenB (vIn)
ThisCharCode = AscB (MidB (vIn, I, 1 ))
If ThisCharCode <& H80 Then
StrReturn = strReturn & Chr (ThisCharCode)
Else
NextCharCode = AscB (MidB (vIn, I + 1, 1 ))
StrReturn = strReturn & Chr (CLng (ThisCharCode) * & H100 + CInt (NextCharCode ))
I = I + 1
End If
Next
Bytes2BSTR = strReturn
End Function
The function of bytes2BSTR can also use Adodb. stream components are implemented through the following functions, although the following functions can specify the character set Charset, but it does not convert the encoding, that is, to pass the "UTF-8" to the parameter sCset, to read a GB2312 encoded webpage.
'Charsethelper can correctly read files encoded in sCset (such as "GB2312", "UTF-8", etc.)
Function CharsetHelper (arrBytes, sCset)
Call D ("CharsetHelper:" + sCset)
Dim oAdos
Set oAdos = CreateObject ("Adodb. Stream ")
With oAdos
. Type = 1
. Mode = 3 'admodereadwrite
. Open
. Write arrBytes
. Position = 0
. Type = 2
. Charset = sCset
CharsetHelper =. ReadText
. Close
End
Set oAdos = Nothing
End Function
2. Download images (and other files) from the remote webpage at the same time)
'Function: ProcessRemoteUrl
'Function: Replace the remote file in the string with a local file and save the remote file.
'Parameter: string to be replaced by strContent, that is, the content of the remote webpage File
'Ssavepath indicates the local storage path of the remote file because it does not end with a slash (/).
'Spreceding changed URL prefix, such as http: // somehost/upload/
'Return: Replace the remote path with the new webpage text content after the local path.
Function ProcessRemoteUrl (sContent, sSavePath, sPreceding)
Call D ("ProcessRemoteUrl ")
Set re = new RegExp
Re. IgnoreCase = true
Re. Global = True
'In the following regular expression. SubMatches (4) = full name of the file name. SubMatches (5) File Extension
Re. Pattern = "(http ):(? : \/) {1 }(? :(? : \ W) + [.]) + (net | com | cn | org | cc | TV | [0-9] {1, 4}) (\ S *\/)((? : \ S) + [.] {1} (gif | jpg | jpeg | png | bmp )))"
Set RemoteFile = re. Execute (sContent)
Dim SaveFileName
'Remotefile regular expression Match object set
'Remotefileurl regular expression Match object
For Each RemoteFileUrl in RemoteFile
SaveFileName = RemoteFileUrl. SubMatches (4)
Call Save2File (myHttpGet (RemoteFileUrl, False), sSavePath & "/" & SaveFileName, False, True)
SContent = Replace (sContent, RemoteFileUrl, sPreceding & SaveFileName)
Next
ProcessRemoteUrl = sContent
End Function
Improvement: detect real URLs
The above ProcessRemoteUrl function cannot be correctly processed, such as and <a href = "/upload/abc.gif "... to process these links, we can first use the following function to convert the relative links in the web page to absolute links.
'Function: DetectUrl
'Function: Replace the relative path of the remote file in the string with an absolute path starting with http: // ..
'Parameter: Text Content of the webpage with relative paths to be processed by sContent
The URL of the remote webpage processed by 'surl, used to analyze the relative path
'Return: Replace the new webpage text content after the relative link is an absolute link.
Function DetectUrl (sContent, sUrl)
Call D ("DetectUrl:" & sUrl)
'Analyze URL
Dim re, sMatch
Set re = new RegExp
Re. Multiline = True
Re. IgnoreCase = true
Re. Global = True
Re. Pattern = "(http: // [-A-Z0-9.] +)/[-A-Z0-9 + @ # % ~ _ |! :,.;/] + /"
Dim sHost, sPath
'Http: // localhost/get/sample. asp
Set sMatch = re. Execute (sUrl)
'Http: // localhost
SHost = sMatch (0). SubMatches (0)
'Http: // localhost/get/
SPath = sMatch (0)
Re. Pattern = "(src | href) = ""? ((?! Http: //) [-A-Z0-9 + @ # % = ~ _ |! :,.;/] + )""? "
Set RemoteFile = re. Execute (sContent)
'Remotefile regular expression Match object set
'Remotefileurl regular expression Match object, such as src = "Upload/a.jpg"
Dim sAbsoluteUrl
For Each RemoteFileUrl in RemoteFile
', ,
If Left (RemoteFileUrl. SubMatches (1), 1) = "/" Then
SAbsoluteUrl = sHost
Else
SAbsoluteUrl = sPath
End If
SAbsoluteUrl = RemoteFileUrl. SubMatches (0) & "=" "& sAbsoluteUrl & RemoteFileUrl. SubMatches (1 )&""""
SContent = Replace (sContent, RemoteFileUrl, sAbsoluteUrl)
Next
DetectUrl = sContent
End Function
Improvement: avoid repeated downloads
Some images in the web page will be repeatedly downloaded if they are repeated in spacer.gif. One way to avoid this problem is to set an arrUrls array and put the URLs of collected files in it, before each collection, traverse the array to see if the collection has been completed. Then, only files with no parameter set in the parameter set are displayed.
3. Practical examples (using *** as an example)
* *** Is the most frequent place for me, and the network speed is good. Let's take her for example. It's not malicious :-)
Analysis list page
Content Page skills
Previous and next pages on the analysis content page
I thought for a moment, this part of the content is still not written during the seek, so that it will not be written by BS :-), but also saves the time to write more words. It's nothing more than collecting remote web pages, and then using regular expressions to analyze and extract specific content, such as the title, author, and content. I have two little experiences:
First, the content before and after the web page source code has a great deal of interference with the analysis, you can use the following method to remove it first
'Extract part of the content for analysis. You can use EditPlus numbers.
'Remove the first 7600 and last 5000 characters
SPageW = Left (sPageW, Len (sPageW)-5000)
SPageW = Mid (spagews, 7600)
Second, you may not want to leave continuous browsing records on the server of the other party. The following small function will be helpful.
'Process: Sleep
'Function: The program stops for a few seconds here
'Parameter: the number of seconds for iSeconds to be paused.
Sub Sleep (iSeconds)
D Timer () & "<font color = blue> Sleep For" & iSeconds & "Seconds </font>"
Dim t: t = Timer ()
While (Timer () <t + iSeconds)
'Do Nothing
Wend
D Timer () & "<font color = blue> Sleep For" & iSeconds & "Seconds OK </font>"
End Sub
'Call example: the pause is stopped, and the duration is random, within 3 seconds.
Sleep (Fix (Rnd () * 3 ))
Third, use the regular expression test tool to improve the efficiency of writing regular expressions.
4. Advanced theme: Conversion of UTF-8 and GB2312
This problem is complicated because I have not completed the problem completely due to my intellectual and mental factors. Most of the information I have on the Internet is not completely correct or comprehensive, I recommend a UTF-8 and GB2312 conversion of C language implementation for your reference, it is fully functional and does not rely on Windows API functions.
I am trying to use ASP + VBScript to implement it. I have some immature experience:
Files on the computer, strings in the operating system are Unicode, so the conversion between the UTF-8 and GB2312 needs to be Unicode as the intermediary
UTF-8 is a variant of Unicode, the conversion between them is relatively simple, refer to it
The encoding of GB2312 and Unicode seems unrelated. An Encoding ing table is required for conversion without relying on the internal functions of the operating system. It is pointed out the one-to-one correspondence between GB2312 and Unicode encoding, this encoding table contains about 7480x2 projects.
In an ASP file, to read a string by default with a certain and encoding (such as GB2312), you need to set ASP CodePage as the corresponding code page (for GB2312, CodePage = 936)
There are some small and important issues in encoding conversion that I don't know yet :-(
5. More advanced topics: capture after login, client forgery, etc.
Xmlhttp objects can interact with http servers using post or get methods. You can set and read http headers, learn about the http protocol, and learn more about the methods and attributes of xmlhttp objects, you can use it to simulate a browser and perform various repetitive work that people previously needed.
6. Existing collection programs
This article aims to discuss the implementation of the collection program in the ASP + VBScript environment. If you need a Web page collection program, the following links may be useful to you.
LocoySpider homepage content collector
C # +. net content collector, an important feature of which is not to save the collected content to the database, but to use custom POST to submit other web pages, such as the new content page of the content management system. Free of charge.
BeeCollector (beebot collector)
PHP + MySQL content collector.
Fengxun Content Management System
This powerful content management system contains an ASP Web page content collector.