In a simple way, thread processing is also acceptable, but I still cannot determine whether to end the processing thread well, so I will not post this aspect.
Idea: Get the content of the specified URL through webrequest and webresponse, and then use a regular expression to match the HTML part we need. This requires analyzing the Page Structure of the current request and then processing it accordingly. The following uses http://bbs.csdn.net/recommend_tech_topicsas an example.
Http://bbs.csdn.net/recommend_tech_topicspage as shown below:
Now we only need the intermediate post information, view the source code to see the structure:
We found that the content of the current requirement is located in the DIV of class = "tit_1". It is easy to know this rule. first go to the Code:
To facilitate page flip, I add a previous page or next page,
<% @ Page Language = "C #" autoeventwireup = "true" codefile = "testcollection. aspx. cs" inherits = "testcollection" %> <! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <HTML xmlns = "http://www.w3.org/1999/xhtml">
using System;using System.IO;using System.Net;using System.Text;using System.Text.RegularExpressions;
Protected void page_load (Object sender, eventargs e) {// recommend_tech_topics? Page = 2 string RL; webrequest myreq = webrequest. Create (httpurldomain + "/recommend_tech_topics? Page = "+ pageindex); webresponse Myres = myreq. getresponse (); stream resstream = Myres. getresponsestream (); streamreader sr = new streamreader (resstream, encoding. utf8); stringbuilder sb = new stringbuilder (); While (RL = sr. readline ())! = NULL) {sb. appendline (RL);} RegEx = new RegEx ("<Div class = \" list_1 \ "> ([\ s] *) </div> ([\ s] *) <Div class = \ "page_nav \"> ", regexoptions. compiled); match = RegEx. match (sb. tostring (); If (match. success) This. pageurlinfo. innerhtml = match. groups [0]. value; Myres. close () ;}/// <summary> // obtain the page number /// </Summary> Public String pageindex {get {return request. querystring ["page"]! = NULL? (Int. parse (request. querystring ["page"]. tostring ()> 0? Request. querystring ["page"]. tostring (): "1"): "1" ;}} Public String httpurldomain {get {return "http://bbs.csdn.net ";}}
Okay, let's see how it works:
Well, if we want to get the corresponding content to the database, how can we deal with it? It's easy to modify the Regular Expression and then write the matched objects into a temporary datatable one by one, importing DT to the database or creating SQL statements (it is better to build slq statements, of course, it must be parameterized ).
match.Groups[i].Value
Is the matched value.