Using ASP to implement Remote crawl Web page to local database

Source: Internet
Author: User

Received a task is to add some content of Chinese brand website to our website, some of these pages are links to the list of articles, click on the link will appear in the article detailed content Display page, according to this rule, combining regular expressions, XMLHTTP technology, JScript server-side script, as well as ADO technology, wrote a small program to crawl this content to the local database. Crawl down, and then on the database to guide the data on the database is more convenient. First create an Access database, structured as follows

Id Automatic numbering Identification, PRIMARY key
Oldid Digital Old Data encoding
Title Title Text
Content Note Content

The actual implementation code is as follows

<% @LANGUAGE = "JSCRIPT" codepage= "936"%>
!--METADATA name= "Microsoft ActiveX Data Objects 2.5 Library"
Type= "TypeLib" uuid= "{00000205-0000-0010-8000-00aa006d2ea4}"-->
<%
Open Database
Try
{
var strconnectionstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data source=" + Server.MapPath ("#db. mdb");
var objconnection = Server.CreateObject ("ADODB.") Connection ");
Objconnection.open (strConnectionString);
}
catch (E)
{
Response.Write (e.description);
Response.End ();
}
%>
<script language= "JScript" runat= "Server"
Remotely fetching data
function GetData ()
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/mppro2.php", false);
Xhttp.send ();
return (Xhttp.responsetext);
}
Using regular expressions to extract qualified links
function Getlinks (str)
{
var re = new RegExp ("<a[^ <>]+?\>" (. | \ n) *?) <\/a> "," GI ");
var a = Str.match (re); First search for (Var i=0;i<a.length;i++)
{
var t1,t2;
var temp;
var r =/qy.php\?id= (\d+)/ig;
if (!r.test (A[i])) continue;
temp = A[i].match (/qy.php\?id= (\d+)/ig);
T1 = regexp.$1;
temp = A[i].match (/<font[^ <>]+?color=\ "#000000 \" \> (. *?) <\/font>/ig);
t2 = regexp.$1;
if (t1 = = t2) Continue;
Savearticle (t1,t2,getcontent (t1));
}
}
Get the ID through the extracted link and fetch the corresponding article through this ID
function getcontent (ID)
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/qy.php id=" + id,false);
Xhttp.send ();
var str = Xhttp.responsetext;
var re = new RegExp ("<span[^ <>]+?style=\" Font-size:10 \.8pt\ "> (. *?) <\/span> "," GI ");
var a = Str.match (re);
return (REGEXP.$1);
}
Storage
function Savearticle (oldid,title,content)
{
var orst = Server.CreateObject ("ADODB.") Recordset ");
var squery;
squery = "Select Oldid,title,content from Articles"
Orst.open (squery,objconnection,adopenstatic,adlockpessimistic);
Orst.addnew ();
Orst ("oldid") = oldid;
Orst ("title") = title;
Orst ("content") = content;
Orst.update ();
Orst.close ();
Response.Write (Title + "crawl Success" + "<br>");
}
</script>
<HTML>
<HEAD>
<TITLE> Crawl Articles </TITLE>
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "> </HEAD>
<BODY>
<%=getlinks (GetData ())%>
</BODY>
</HTML>

The next step is to import the contents of this Access database into the server database. But there are still some things, that is, the original article is classified, so import when you have to manually classify, because in the analysis of the link when the regular expression is very troublesome, but still very rigorous, if the classification is also used If the expression is parsed, can be very troublesome, because the category is contained in the inside, and that page of the td> tag, to locate the category text in the will be very troublesome, even if written, the program will lose flexibility, become difficult to maintain, so now only to do this step.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.