Received a task is to add some content of Chinese brand website to our website, some of these pages are links to the list of articles, click on the link will appear in the article detailed content Display page, according to this rule, combining regular expressions, XMLHTTP technology, JScript server-side script, as well as ADO technology, wrote a small program to crawl this content to the local database. Crawl down, and then on the database to guide the data on the database is more convenient. First create an Access database, structured as follows
| Id |
Automatic numbering |
Identification, PRIMARY key |
| Oldid |
Digital |
Old Data encoding |
| Title |
Title |
Text |
| Content |
Note |
Content |
The actual implementation code is as follows
<% @LANGUAGE = "JSCRIPT" codepage= "936"%>
!--METADATA name= "Microsoft ActiveX Data Objects 2.5 Library"
Type= "TypeLib" uuid= "{00000205-0000-0010-8000-00aa006d2ea4}"-->
<%
Open Database
Try
{
var strconnectionstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data source=" + Server.MapPath ("#db. mdb");
var objconnection = Server.CreateObject ("ADODB.") Connection ");
Objconnection.open (strConnectionString);
}
catch (E)
{
Response.Write (e.description);
Response.End ();
}
%>
<script language= "JScript" runat= "Server"
Remotely fetching data
function GetData ()
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/mppro2.php", false);
Xhttp.send ();
return (Xhttp.responsetext);
}
Using regular expressions to extract qualified links
function Getlinks (str)
{
var re = new RegExp ("<a[^ <>]+?\>" (. | \ n) *?) <\/a> "," GI ");
var a = Str.match (re); First search for (Var i=0;i<a.length;i++)
{
var t1,t2;
var temp;
var r =/qy.php\?id= (\d+)/ig;
if (!r.test (A[i])) continue;
temp = A[i].match (/qy.php\?id= (\d+)/ig);
T1 = regexp.$1;
temp = A[i].match (/<font[^ <>]+?color=\ "#000000 \" \> (. *?) <\/font>/ig);
t2 = regexp.$1;
if (t1 = = t2) Continue;
Savearticle (t1,t2,getcontent (t1));
}
}
Get the ID through the extracted link and fetch the corresponding article through this ID
function getcontent (ID)
{
var xhttp = new ActiveXObject ("Microsoft.XMLHTTP");
Xhttp.open ("POST", "http://www.chinamp.org/qy.php id=" + id,false);
Xhttp.send ();
var str = Xhttp.responsetext;
var re = new RegExp ("<span[^ <>]+?style=\" Font-size:10 \.8pt\ "> (. *?) <\/span> "," GI ");
var a = Str.match (re);
return (REGEXP.$1);
}
Storage
function Savearticle (oldid,title,content)
{
var orst = Server.CreateObject ("ADODB.") Recordset ");
var squery;
squery = "Select Oldid,title,content from Articles"
Orst.open (squery,objconnection,adopenstatic,adlockpessimistic);
Orst.addnew ();
Orst ("oldid") = oldid;
Orst ("title") = title;
Orst ("content") = content;
Orst.update ();
Orst.close ();
Response.Write (Title + "crawl Success" + "<br>");
}
</script>
<HTML>
<HEAD>
<TITLE> Crawl Articles </TITLE>
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "> </HEAD>
<BODY>
<%=getlinks (GetData ())%>
</BODY>
</HTML>
The next step is to import the contents of this Access database into the server database. But there are still some things, that is, the original article is classified, so import when you have to manually classify, because in the analysis of the link when the regular expression is very troublesome, but still very rigorous, if the classification is also used If the expression is parsed, can be very troublesome, because the category is contained in the inside, and that page of the td> tag, to locate the category text in the will be very troublesome, even if written, the program will lose flexibility, become difficult to maintain, so now only to do this step.