Using ASP to implement Remote crawl Web page to local database

Source: Internet
Author: User
Tags continue expression implement regular expression return access database access
Data | database | Web page received a task is to add some content of Chinese brand website to our website, some of these pages are links to the list of articles, click on the link will appear in the article detailed content Display page, according to this rule, combining regular expression, XMLHTTP technology, JScript server-side script , as well as ADO technology, wrote a small program, the content crawled to the local database. Crawl down, and then on the database to guide the data on the database is more convenient. First create an Access database, structured as follows

ID Automatic number identification, primary key oldid digital old data encoding title title text content remarks
The actual implementation code is as follows

<% @LANGUAGE = "JSCRIPT" codepage= "936"%>
!--METADATA name= "Microsoft ActiveX Data Objects 2.5 Library"
Type= "TypeLib" uuid= "{00000205-0000-0010-8000-00aa006d2ea4}"-->
<%
Open Database
Try
{
var strconnectionstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data source=" + Server.MapPath ("#db. mdb");
var objconnection = Server.CreateObject ("ADODB.") Connection ");
Objconnection.open (strConnectionString);
}
catch (E)
{
Response.Write (e.description);
Response.End ();
}
%>

<script language= "JScript" runat= "Server"
Remotely fetching data

function GetData ()
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/mppro2.php", false);
Xhttp.send ();
return (Xhttp.responsetext);
}

Using regular expressions to extract qualified links

function Getlinks (str)
{
var re = new RegExp ("<a[^ <>]+?\>" (. | \ n) *?) <\/a> "," GI ");
var a = Str.match (re); First search for (Var i=0;i<a.length;i++)
{
var t1,t2;
var temp;
var r =/qy.php\?id= (\d+)/ig;

if (!r.test (A[i])) continue;
temp = A[i].match (/qy.php\?id= (\d+)/ig);
T1 = regexp.$1;
temp = A[i].match (/<font[^ <>]+?color=\ "#000000 \" \> (. *?) <\/font>/ig);
t2 = regexp.$1;
if (t1 = = t2) Continue;
Savearticle (t1,t2,getcontent (t1));
}
}

Get the ID through the extracted link and fetch the corresponding article through this ID

function getcontent (ID)
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/qy.php?id=" + id,false);
Xhttp.send ();
var str = Xhttp.responsetext;
var re = new RegExp ("<span[^ <>]+?style=\" font-size:10\.8pt\ "> (. *?) <\/span> "," GI ");
var a = Str.match (re);
return (REGEXP.$1);
}

Storage

function Savearticle (oldid,title,content)
{
var orst = Server.CreateObject ("ADODB.") Recordset ");
var squery;
squery = "Select Oldid,title,content from Articles"
Orst.open (squery,objconnection,adopenstatic,adlockpessimistic);
Orst.addnew ();
Orst ("oldid") = oldid;
Orst ("title") = title;
Orst ("content") = content;
Orst.update ();
Orst.close ();
Response.Write (Title + "crawl Success" + "<br>");
}

</script>
<HTML>
<HEAD>
<TITLE> Crawl Articles </TITLE>
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "> </HEAD>
<BODY>
<%=getlinks (GetData ())%>
</BODY>

</HTML>

The next step is to import the contents of this Access database into the database of the server, but there are some things, is the original article is classified, so import the time also have to manually classify, because in the analysis of the link when the regular expression of the original writing is very troublesome, but still very rigorous, If the classification is parsed with regular expressions, can be troublesome, because the category is included in the <td> inside, and the page of the <td> label a lot, to locate the category text in the <td> will be very troublesome, even if written, the program will lose flexibility, become difficult to maintain, so now only to do this step.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.