Using ASP to implement Remote crawl Web page to local database

Last Update:2017-02-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Received a task is to add some content of Chinese brand website to our website, some of these pages are links to the list of articles, click on the link will appear in the article detailed content Display page, according to this rule, combining regular expressions, XMLHTTP technology, JScript server-side script, as well as ADO technology, wrote a small program to crawl this content to the local database. Crawl down, and then on the database to guide the data on the database is more convenient. First create an Access database, structured as follows

Id	Automatic numbering	Identification, PRIMARY key
Oldid	Digital	Old Data encoding
Title	Title	Text
Content	Note	Content

The actual implementation code is as follows

<% @LANGUAGE = "JSCRIPT" codepage= "936"%> !--METADATA name= "Microsoft ActiveX Data Objects 2.5 Library" Type= "TypeLib" uuid= "{00000205-0000-0010-8000-00aa006d2ea4}"--> <% Open Database Try { var strconnectionstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data source=" + Server.MapPath ("#db. mdb"); var objconnection = Server.CreateObject ("ADODB.") Connection "); Objconnection.open (strConnectionString); } catch (E) { Response.Write (e.description); Response.End (); } %> <script language= "JScript" runat= "Server" Remotely fetching data function GetData () { var xhttp = new ActiveXObject ("Microsoft.XMLHTTP"); Xhttp.open ("POST", "http://www.chinamp.org/mppro2.php", false); Xhttp.send (); return (Xhttp.responsetext); } Using regular expressions to extract qualified links function Getlinks (str) { var re = new RegExp ("<a[^ <>]+?\>" (. | \ n) *?) <\/a> "," GI "); var a = Str.match (re); First search for (Var i=0;i<a.length;i++) { var t1,t2; var temp; var r =/qy.php\?id= (\d+)/ig; if (!r.test (A[i])) continue; temp = A[i].match (/qy.php\?id= (\d+)/ig); T1 = regexp.$1; temp = A[i].match (/<font[^ <>]+?color=\ "#000000 \" \> (. *?) <\/font>/ig); t2 = regexp.$1; if (t1 = = t2) Continue; Savearticle (t1,t2,getcontent (t1)); } } Get the ID through the extracted link and fetch the corresponding article through this ID function getcontent (ID) { var xhttp = new ActiveXObject ("Microsoft.XMLHTTP"); Xhttp.open ("POST", "http://www.chinamp.org/qy.php id=" + id,false); Xhttp.send (); var str = Xhttp.responsetext; var re = new RegExp ("<span[^ <>]+?style=\" Font-size:10 \.8pt\ "> (. *?) <\/span> "," GI "); var a = Str.match (re); return (REGEXP.$1); } Storage function Savearticle (oldid,title,content) { var orst = Server.CreateObject ("ADODB.") Recordset "); var squery; squery = "Select Oldid,title,content from Articles" Orst.open (squery,objconnection,adopenstatic,adlockpessimistic); Orst.addnew (); Orst ("oldid") = oldid; Orst ("title") = title; Orst ("content") = content; Orst.update (); Orst.close (); Response.Write (Title + "crawl Success" + "<br>"); } </script> <HTML> <HEAD> <TITLE> Crawl Articles </TITLE> <meta http-equiv= "Content-type" content= "text/html; charset=gb2312 "> </HEAD> <BODY> <%=getlinks (GetData ())%> </BODY> </HTML>

The next step is to import the contents of this Access database into the server database. But there are still some things, that is, the original article is classified, so import when you have to manually classify, because in the analysis of the link when the regular expression is very troublesome, but still very rigorous, if the classification is also used If the expression is parsed, can be very troublesome, because the category is contained in the inside, and that page of the td> tag, to locate the category text in the will be very troublesome, even if written, the program will lose flexibility, become difficult to maintain, so now only to do this step.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using ASP to implement Remote crawl Web page to local database

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support