Crawling web pages to a local database in ASP. NET

Source: Internet
Author: User

A task is to add some content of a Chinese Famous Brand website to our website. The address is as follows:

Http://www.chinamp.org/mppro2.php

This page contains a list of links to some articles. clicking a link will display a detailed content display page of the Article. Based on this rule, combined with regular expressions, XMLHTTP technology, JScript server scripts, and ADO technology, write a small program to capture the content to the local database. After capturing the data, it is easier for the database to export data to the database. Create an Access database with the following structure:

ID
Automatic ID
ID, primary key

Oldid
Number
Old Data Encoding

Title
Title
Text

Content
Remarks
Content

The specific implementation code is as follows:

<% @ Language = "jscript" codePage = "936" %>

<! -- Metadata name = "Microsoft ActiveX Data Objects 2.5 library"

Type = "typelib" UUID = "{00000205-0000-0010-8000-00aa006d2ea4}" -->

<%

// Open the database

Try

{

VaR strconnectionstring = "provider = Microsoft. Jet. oledb.4.0; Data Source =" + server. mappath ("# dB. mdb ");

VaR objconnection = server. Createobject ("ADODB. Connection ");

Objconnection. Open (strconnectionstring );

}

Catch (E)

{

Response. Write (E. Description );

Response. End ();

}

%>

<Script language = "jscript" runat = "server">

// Obtain data remotely

Function getdata ()

{

VaR xhttp = new activexobject ("Microsoft. XMLHTTP ");

Xhttp. Open ("Post", "http://www.chinamp.org/mppro2.php", false );

Xhttp. Send ();

Return (xhttp. responsetext );

}

// Use regular expressions to extract matching links

Function getlinks (STR)

{

VaR Re = new Regexp ("<A [^ <>] +? /> (. |/N )*?) <// A> "," Gi ");

VaR A = Str. Match (re); // The first search

For (VAR I = 0; I <A. length; I ++)

{

VaR T1, T2;

VaR temp;

VaR r =/QY. php /? Id = (/d +)/ig;

If (! R. Test (A) continue;

Temp = A. Match (/QY. php /? Id = (/d +)/ig );

T1 = Regexp. $1;

Temp = A. Match (/<font [^ <>] +? Color =/"#000000/"/> (.*?) <// Font>/ig );

T2 = Regexp. $1;

If (T1 = t2) continue;

Savearticle (T1, T2, getcontent (T1 ));

}

}

// Obtain the ID through the extracted link and capture the corresponding article through this ID

Function getcontent (ID)

{

VaR xhttp = new activexobject ("Microsoft. XMLHTTP ");

Xhttp. Open ("Post", "http://www.chinamp.org/qy.php? Id = "+ id, false );

Xhttp. Send ();

VaR STR = xhttp. responsetext;

VaR Re = new Regexp ("<span [^ <>] +? Style =/"font-size: 10/. 8pt/"> (.*?) <// Span> "," Gi ");

VaR A = Str. Match (re );

Return (Regexp. $1 );

}

// Warehouse receiving

Function savearticle (oldid, title, content)

{

VaR orst = server. Createobject ("ADODB. recordset ");

VaR squery;

Squery = "select oldid, title, content from articles"

Orst. Open (squery, objconnection, adopenstatic, adlockpessimistic );

Orst. addnew ();

Orst ("oldid") = oldid;

Orst ("title") = title;

Orst ("content") = content;

Orst. Update ();

Orst. Close ();

Response. Write (Title + "captured successfully" + "<br> ");

}

</SCRIPT>

<HTML>

<Head>

<Title> capture an article </title>

<Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312">

<Body>

<% = Getlinks (getdata () %>

</Body>

</Html>

The next step is to import the content of this access database to the database on the server, but there are still some other things, that is, the original articles are classified, so you have to manually classify them during import, this is because it is difficult to write a regular expression during link analysis, but it is still rigorous. If you use a regular expression to parse the classification, it will be very troublesome, because the classification is included in <TD>, and there are many <TD> labels on that page, it is very troublesome to locate the <TD> where the classification text is located, even if it is written, the program will also lose flexibility and become difficult to maintain, so now only this step is done.

 

 

 

How to use a regular expression to present data in a webpage table?

<Table [^>] *> [/S] *? <A [/S] *? Href = ("(? <Href> [^ "] *)" | '(? <Href> [^ '] *)' | (? <Href> [^>/S] *) [^>] *?> (? <Title> [/S] *?) </A> [/S] *? </Table>

Test:
String content = @ "<Table border =" "0" "width =" "11%" "class =" "headline" ">
<Tr>
<TD width = "" 100% "">
<P align = "" center ""> This is the first table </TD>
<TD>
<A href = "" http://www.163.com "" target = _ blank> Netease </a> </TD>
</Tr>
</Table> ";
RegEx htmlregex = new RegEx (
@ "<Table [^>] *> [/S] *? <A [/S] *? Href = (""(? <Href> [^ ""] *) "" | '(? <"
+ @ "Href> [^ '] *)' | (? <Href> [^>/S] *) [^>] *?> (? <Title> [/S] *?) </A> ["
+ @ "/S] *? </Table> ",
Regexoptions. ignorecase | regexoptions. Compiled );
// Content = htmlregex. Replace (content ,"");

Matchcollection MC = htmlregex. Matches (content );
String [] DIV = new string [MC. Count];
For (INT I = 0; I <MC. Count; I ++)
{
// Int n = int32.parse (MC [I]. Groups ["content"]. value );
Console. writeline (MC [I]. Groups [0]. Value); // the whole table
Console. writeline (MC [I]. Groups ["href"]. Value); // URL
Console. writeline (MC [I]. Groups ["title"]. Value); // text
// Console. writeline ();
// Div [I] = mc [I]. Groups ["content"]. value;
}
Output:
<Table border = "0" width = "11%" class = "headline">
<Tr>
& Lt; TD width = "100%" & gt;
<P align = "center"> This is the first table </TD>
<TD>
<A href = "http://www.163.com" target = _ blank> Netease </a> </TD>
</Tr>
</Table>
Http://www.163.com
Netease

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.