Help you to build their own search engine---Baidu Chapter _ Thieves/Collection

Source: Internet
Author: User
Tags error handling
Want to have their own search engine? By using the current data collection method, you can have it immediately. Here are some steps you could take to achieve it.

First, know Baidu search

Baidu Search, the world's largest Chinese search engine, August 5, 2005 in the United States Nasdaq listed transactions, is currently the highest rate of domestic users of the search engine, providing Web pages, news, pictures, music, maps and other kinds of search

1, Baidu Web search query parameters

Required Parameters

Keywords for ☆wd--queries (Keyword)
Number of pages ☆pn--display results (page number)
☆cl--Search Type (Class), cl=3 for web search

Optional parameters
☆rn--search results show the number of bars (record numbers), 10--100 between bars, and the default setting rn=10
☆ie--Query Input text encoding (input Encoding), the default setting ie=gb2312, which is simplified Chinese
☆tn--the source site where the search request was submitted
A few useful TN
Tn=baidulocal said Baidu site search, returned the result is very clean, no advertising interference. For example, in Baidu Station search "Happy", see return result is not very refreshing.
Tn=baiducnnic want to put Baidu in the frame? Try this parameter on it, is Baidu for CNNIC custom

☆si--in limited domain name search, for example, want to search in Sina Station can use parameter si=sina.com.cn, want to make this parameter effective must combine CT parameter to use together.

☆ct--the value of this parameter is typically a string of numbers, which is estimated to be the authentication code for the search request

Si and CT parameters used in combination, such as in sina.com.cn search "ideal", available: Http://www.baidu.com/baidu?ie=utf-8&am ... n&cl=3&word= ideal

☆bs--last search keyword (before search), estimated related

2, Baidu search results page structure

From top to bottom by source code structure:

Search box
On the right side of the hot zone fixed rank
Search results
Paging area
Related Search
Bottom search box
Copyright area

Where the "search results, paging area," The two parts is the valid data we need, according to its code results can be found that its unique string ID, through the identification of the interception of content can be, specifically look at the following code.

Second, the core function--use ASP's XMLHTTP component

Data acquisition program, commonly known as the Thief program, its core part is this XMLHTTP component, with XMLHTTP data collection Some cliché, online data are many, the general collection Code are

Set Http=server.createobject ("MSXML2. XMLHTTP ")
Http.open "Get", Url,false ' open XMLHTTP
Http.send () ' Send request
If Http.readystate<>4 Then
Exit function
End If
Gethttppage=bytestobstr (Http.responsebody, "GB2312") ' Returns the result (typically a byte stream) and converts the byte stream to a string
Set http=nothing ' Release XMLHTTP

Detailed application see the complete code below

Third, complete code (filename: searchi_bd.asp)

<%
Option Explicit
Dim WD,PN
WD = Request ("WD")
PN = Request.QueryString ("pn")
' Start error handling
On Error Resume Next
If err.number <> 0 Then
Response.Clear
' Display error message to user
Response.Write "<p align= ' center ' ><font size=3> error, please reopen Baidu search .</font></p>"
End If
%>
<HTML>
<HEAD>
<TITLE> Baidu Search--<%=wd%></title>
</HEAD>
<style type=text/css>
<!--
Body,td{font-family:arial}
td{font-size:9pt; LINE-HEIGHT:18PX}
. Cred{color: #FF0000}
-->
</STYLE>

<body leftmargin= "0" topmargin= "3" marginwidth= "0" marginheight= "0" >
<table align= "center" width= "98%" cellspacing= "0" cellpadding= "0" border= "0" bgcolor= "#ffffff" >
<tr>
<form name= "F1" method= "post" action= "searchi_bd.asp" >
&LT;TD width=150 height=50>
Your logo.
</td>
&LT;TD align= "Left" >
<input name=wd size= "maxlength=" "title=" input keyword, then let ' s searching ... "value=" <%=wd%> ">
<input type= "Submit" value= "Baidu Search" >
</td></form></tr>
</table>
<%
Dim STRURL,STRTMP_BD,STRINFO,STRPAGE,STRPAGESUM_BD,STRQTIME_BD
Dim bnoresult_bd,regex,patrn
' Baidu query string
strURL = "http://www.baidu.com/s?ie=gb2312&wd=" &wd&am ... &pn& "&cl=3"
' Start collecting
STRTMP_BD = Gethttppage (strURL)
If InStr (STRTMP_BD, "not found and your query") <>0 Then
Bnoresult_bd=1
End If

' Intercepts the contents of the Search Results ' section
Strinfo = Strcut (STRTMP_BD, "<div id=scriptdiv></div>", "<br clear=all>", 2)
Patrn= "</td></tr></table><br>"
Set regEx = New RegExp ' establishes a regular expression.
Regex.pattern = Patrn ' Set mode.
Regex.ignorecase = True
Regex.global = False
Strinfo=regex.replace (Strinfo, "")

' Intercept the contents of the paging section
Strpage = Strcut (STRTMP_BD, "<br clear=all>", "<br>", 2)
Strpage = Replace (Strpage, "Href=s?", "Href=searchi_bd.asp?")
' Number of results and spents
Strpagesum_bd=strcut (STRTMP_BD, "Find relevant pages about", "article", 2)
If not IsNumeric (STRPAGESUM_BD) Then
Strpagesum_bd=strcut (STRTMP_BD, "Find the relevant page", "article", 2)
End If
Strqtime_bd=strcut (STRTMP_BD, "spents", "Seconds", 2)
Set strtmp_bd=nothing

%>
<!--T1-start-->
<table cellspacing=0 cellpadding=0 border=0 width=98% align= "center" >
<tr Valign=center Align=middle height=18>
&LT;TD width=1 bgcolor= #999999 >

&LT;TD nowrap style= "font-weight:bold; COLOR: #ffffff; Background-color: #0033cc "width=64> Internet </td>

&LT;TD align=right bgcolor= #eeeeee ><nobr> Find related Web pages that match <b><%=wd%></b> <b><%= Strpagesum_bd%></b> articles, spents <b><%=strQtime_bd%></b> sec </nobr> </td>
</tr>
&LT;TR&GT;&LT;TD bgcolor= #999999 colspan=3 height=2></td></tr></table>
</td>
</tr>
</table>

<%
If wd= "" Then
Response.Write "<p align= ' center ' ><font size=-1> Hello, please enter the keyword in the search box .</font></p>"
ElseIf Bnoresult_bd=1 Then
Response.Write "<p align= ' center ' ><font size=-1> Sorry, no information is found to meet your query criteria, please select the appropriate keyword to query. </font></ P> "
Else
%>
<table width= "98%" align= "center" cellspacing= "0" cellpadding= "0" border= "0" >
<tr>
&LT;TD style=line-height:160% bgcolor= "#ffffff" width= "75%" valign=top><br>
<%=strinfo%>
</td>
&LT;TD width= "25%" valign=top><br> This is your space to play!
</td>
</tr>
</table>
<table width= "98%" align= "center" cellspacing= "0" cellpadding= "4" border= "0" >
<tr>
&LT;TD align= "center" >
<br><font size=3><%=strpage%></font>
</td>
</tr>
</table>
<%end If
Set strinfo=nothing

%>
&LT;HR size= "1" width= "760" color= "#0000ff" >

<div align= "center" ><font size=-1>
Program Update please come here <span class= "cred" > (Knowledge Sharing Forum) </span> View </font>
</div>
</BODY>
</HTML>

<%
' Acquisition function
Function gethttppage (URL)
On Error Resume Next
Dim http
Set Http=server.createobject ("MSXML2. XMLHTTP ")
Http.open "Get", Url,false
Http.send ()
If Http.readystate<>4 Then
Exit function
End If
Gethttppage=bytestobstr (Http.responsebody, "GB2312")
Set http=nothing
If Err.number<>0 Then
Response.Write "<div align= ' center ' ><b> Server gets file contents error </b></div>"
Err.Clear
End If
End Function
' byte stream converted to string
Function Bytestobstr (Body,cset)
Dim objstream
Set objstream = Server.CreateObject ("ADODB.stream")
Objstream. Type = 1
Objstream. Mode =3
Objstream. Open
Objstream. Write body
Objstream. Position = 0
Objstream. Type = 2
Objstream. Charset = Cset
Bytestobstr = objstream. ReadText
Objstream. Close
Set objstream = Nothing
End Function

' Intercept string, 1. Include front and back strings, 2. Excluding back and forth strings
Function Strcut (Strcontent,startstr,endstr,cuttype)
Dim S1,S2
On Error Resume Next
Select Case Cuttype
Case 1
S1 = InStr (STRCONTENT,STARTSTR)
S2 = InStr (s1,strcontent,endstr) +len (ENDSTR)
Case 2
S1 = InStr (strcontent,startstr) +len (STARTSTR)
S2 = InStr (S1,STRCONTENT,ENDSTR)
End Select
If ERR Then
Strcute = "<p align= ' center ' ><font size=-1> intercept string Error .</font></p>"
Err.Clear
Exit Function
Else
Strcut = Mid (STRCONTENT,S1,S2-S1)
End If
End Function

%>


The above code copy to Notepad saved as searchi_bd.asp, you can use. If you want to change the filename, make the blue logo part of the following code the same as your filename.

Strpage = Replace (Strpage, "Href=s?", "Href=searchi_bd.asp?")

A few notes:

1, Baidu search basically nothing counter collection measures, the main point is that Baidu will change a period of time to return the results of the page source code, so often to observe Baidu's search results page, found that the code changes, will be a few string identification changes. In the reverse collection, Baidu more generous than Google, at present did not find that due to frequent inquiries Baidu and the temporary shielding of the source site IP phenomenon, and in Google query often appear this phenomenon, how to solve the next article in the discussion.

2, the acquisition of more resources, search the thief program the same, so the program as early as possible release variables or objects. If your space resources are not much, it is recommended that you do not do this.

3, some people may not be willing to do in their own search thieves to retain any Baidu's functional connection, such as Baidu snapshots and site search functions. To this end I provide in the download package without Baidu any connection of the compact version, you can use according to need, in this article will not list the code, in fact, and the full version of the same.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.