Data | database | Web page received a task is to add some content of Chinese brand website to our website, some of these pages are links to the list of articles, click on the link will appear in the article detailed content Display page, according to this rule, combining regular expression, XMLHTTP technology, JScript server-side script , as well as ADO technology, wrote a small program, the content crawled to the local database. Crawl down, and then on the database to guide the data on the database is more convenient. First create an Access database, structured as follows
ID Automatic number identification, primary key oldid digital old data encoding title title text content remarks
The actual implementation code is as follows
<% @LANGUAGE = "JSCRIPT" codepage= "936"%>
!--METADATA name= "Microsoft ActiveX Data Objects 2.5 Library"
Type= "TypeLib" uuid= "{00000205-0000-0010-8000-00aa006d2ea4}"-->
<%
Open Database
Try
{
var strconnectionstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data source=" + Server.MapPath ("#db. mdb");
var objconnection = Server.CreateObject ("ADODB.") Connection ");
Objconnection.open (strConnectionString);
}
catch (E)
{
Response.Write (e.description);
Response.End ();
}
%>
<script language= "JScript" runat= "Server"
Remotely fetching data
function GetData ()
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/mppro2.php", false);
Xhttp.send ();
return (Xhttp.responsetext);
}
Using regular expressions to extract qualified links
function Getlinks (str)
{
var re = new RegExp ("<a[^ <>]+?\>" (. | \ n) *?) <\/a> "," GI ");
var a = Str.match (re); First search for (Var i=0;i<a.length;i++)
{
var t1,t2;
var temp;
var r =/qy.php\?id= (\d+)/ig;
if (!r.test (A[i])) continue;
temp = A[i].match (/qy.php\?id= (\d+)/ig);
T1 = regexp.$1;
temp = A[i].match (/<font[^ <>]+?color=\ "#000000 \" \> (. *?) <\/font>/ig);
t2 = regexp.$1;
if (t1 = = t2) Continue;
Savearticle (t1,t2,getcontent (t1));
}
}
Get the ID through the extracted link and fetch the corresponding article through this ID
function getcontent (ID)
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/qy.php?id=" + id,false);
Xhttp.send ();
var str = Xhttp.responsetext;
var re = new RegExp ("<span[^ <>]+?style=\" font-size:10\.8pt\ "> (. *?) <\/span> "," GI ");
var a = Str.match (re);
return (REGEXP.$1);
}
Storage
function Savearticle (oldid,title,content)
{
var orst = Server.CreateObject ("ADODB.") Recordset ");
var squery;
squery = "Select Oldid,title,content from Articles"
Orst.open (squery,objconnection,adopenstatic,adlockpessimistic);
Orst.addnew ();
Orst ("oldid") = oldid;
Orst ("title") = title;
Orst ("content") = content;
Orst.update ();
Orst.close ();
Response.Write (Title + "crawl Success" + "<br>");
}
</script>
<HTML>
<HEAD>
<TITLE> Crawl Articles </TITLE>
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "> </HEAD>
<BODY>
<%=getlinks (GetData ())%>
</BODY>
</HTML>
The next step is to import the contents of this Access database into the database of the server, but there are some things, is the original article is classified, so import the time also have to manually classify, because in the analysis of the link when the regular expression of the original writing is very troublesome, but still very rigorous, If the classification is parsed with regular expressions, can be troublesome, because the category is included in the <td> inside, and the page of the <td> label a lot, to locate the category text in the <td> will be very troublesome, even if written, the program will lose flexibility, become difficult to maintain, so now only to do this step.