A task is to add some content of a Chinese Famous Brand website to our website. The address is as follows:
Http://www.chinamp.org/mppro2.php
This page contains a list of links to some articles. clicking a link will display a detailed content display page of the Article. Based on this rule, combined with regular expressions, XMLHTTP technology, JScript server scripts, and ADO technology, write a small program to capture the content to the local database. After capturing the data, it is easier for the database to export data to the database. Create an Access database with the following structure:
ID
Automatic ID
ID, primary key
Oldid
Number
Old Data Encoding
Title
Title
Text
Content
Remarks
Content
The specific implementation code is as follows:
<% @ Language = "jscript" codePage = "936" %>
<! -- Metadata name = "Microsoft ActiveX Data Objects 2.5 library"
Type = "typelib" UUID = "{00000205-0000-0010-8000-00aa006d2ea4}" -->
<%
// Open the database
Try
{
VaR strconnectionstring = "provider = Microsoft. Jet. oledb.4.0; Data Source =" + server. mappath ("# dB. mdb ");
VaR objconnection = server. Createobject ("ADODB. Connection ");
Objconnection. Open (strconnectionstring );
}
Catch (E)
{
Response. Write (E. Description );
Response. End ();
}
%>
<Script language = "jscript" runat = "server">
// Obtain data remotely
Function getdata ()
{
VaR xhttp = new activexobject ("Microsoft. XMLHTTP ");
Xhttp. Open ("Post", "http://www.chinamp.org/mppro2.php", false );
Xhttp. Send ();
Return (xhttp. responsetext );
}
// Use regular expressions to extract matching links
Function getlinks (STR)
{
VaR Re = new Regexp ("<A [^ <>] +? /> (. |/N )*?) <// A> "," Gi ");
VaR A = Str. Match (re); // The first search
For (VAR I = 0; I <A. length; I ++)
{
VaR T1, T2;
VaR temp;
VaR r =/QY. php /? Id = (/d +)/ig;
If (! R. Test (A) continue;
Temp = A. Match (/QY. php /? Id = (/d +)/ig );
T1 = Regexp. $1;
Temp = A. Match (/<font [^ <>] +? Color =/"#000000/"/> (.*?) <// Font>/ig );
T2 = Regexp. $1;
If (T1 = t2) continue;
Savearticle (T1, T2, getcontent (T1 ));
}
}
// Obtain the ID through the extracted link and capture the corresponding article through this ID
Function getcontent (ID)
{
VaR xhttp = new activexobject ("Microsoft. XMLHTTP ");
Xhttp. Open ("Post", "http://www.chinamp.org/qy.php? Id = "+ id, false );
Xhttp. Send ();
VaR STR = xhttp. responsetext;
VaR Re = new Regexp ("<span [^ <>] +? Style =/"font-size: 10/. 8pt/"> (.*?) <// Span> "," Gi ");
VaR A = Str. Match (re );
Return (Regexp. $1 );
}
// Warehouse receiving
Function savearticle (oldid, title, content)
{
VaR orst = server. Createobject ("ADODB. recordset ");
VaR squery;
Squery = "select oldid, title, content from articles"
Orst. Open (squery, objconnection, adopenstatic, adlockpessimistic );
Orst. addnew ();
Orst ("oldid") = oldid;
Orst ("title") = title;
Orst ("content") = content;
Orst. Update ();
Orst. Close ();
Response. Write (Title + "captured successfully" + "<br> ");
}
</SCRIPT>
<HTML>
<Head>
<Title> capture an article </title>
<Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312">
<Body>
<% = Getlinks (getdata () %>
</Body>
</Html>
The next step is to import the content of this access database to the database on the server, but there are still some other things, that is, the original articles are classified, so you have to manually classify them during import, this is because it is difficult to write a regular expression during link analysis, but it is still rigorous. If you use a regular expression to parse the classification, it will be very troublesome, because the classification is included in <TD>, and there are many <TD> labels on that page, it is very troublesome to locate the <TD> where the classification text is located, even if it is written, the program will also lose flexibility and become difficult to maintain, so now only this step is done.
How to use a regular expression to present data in a webpage table?
<Table [^>] *> [/S] *? <A [/S] *? Href = ("(? <Href> [^ "] *)" | '(? <Href> [^ '] *)' | (? <Href> [^>/S] *) [^>] *?> (? <Title> [/S] *?) </A> [/S] *? </Table>
Test:
String content = @ "<Table border =" "0" "width =" "11%" "class =" "headline" ">
<Tr>
<TD width = "" 100% "">
<P align = "" center ""> This is the first table </TD>
<TD>
<A href = "" http://www.163.com "" target = _ blank> Netease </a> </TD>
</Tr>
</Table> ";
RegEx htmlregex = new RegEx (
@ "<Table [^>] *> [/S] *? <A [/S] *? Href = (""(? <Href> [^ ""] *) "" | '(? <"
+ @ "Href> [^ '] *)' | (? <Href> [^>/S] *) [^>] *?> (? <Title> [/S] *?) </A> ["
+ @ "/S] *? </Table> ",
Regexoptions. ignorecase | regexoptions. Compiled );
// Content = htmlregex. Replace (content ,"");
Matchcollection MC = htmlregex. Matches (content );
String [] DIV = new string [MC. Count];
For (INT I = 0; I <MC. Count; I ++)
{
// Int n = int32.parse (MC [I]. Groups ["content"]. value );
Console. writeline (MC [I]. Groups [0]. Value); // the whole table
Console. writeline (MC [I]. Groups ["href"]. Value); // URL
Console. writeline (MC [I]. Groups ["title"]. Value); // text
// Console. writeline ();
// Div [I] = mc [I]. Groups ["content"]. value;
}
Output:
<Table border = "0" width = "11%" class = "headline">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> This is the first table </TD>
<TD>
<A href = "http://www.163.com" target = _ blank> Netease </a> </TD>
</Tr>
</Table>
Http://www.163.com
Netease