Collect data from a webpage

Source: Internet
Author: User

In a simple way, thread processing is also acceptable, but I still cannot determine whether to end the processing thread well, so I will not post this aspect.

Idea: Get the content of the specified URL through webrequest and webresponse, and then use a regular expression to match the HTML part we need. This requires analyzing the Page Structure of the current request and then processing it accordingly. The following uses http://bbs.csdn.net/recommend_tech_topicsas an example.

Http://bbs.csdn.net/recommend_tech_topicspage as shown below:

Now we only need the intermediate post information, view the source code to see the structure:

We found that the content of the current requirement is located in the DIV of class = "tit_1". It is easy to know this rule. first go to the Code:

To facilitate page flip, I add a previous page or next page,

<% @ Page Language = "C #" autoeventwireup = "true" codefile = "testcollection. aspx. cs" inherits = "testcollection" %> <! Doctype HTML public "-// W3C // dtd xhtml 1.0 transitional // en" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <HTML xmlns = "http://www.w3.org/1999/xhtml"> 

using System;using System.IO;using System.Net;using System.Text;using System.Text.RegularExpressions;

Protected void page_load (Object sender, eventargs e) {// recommend_tech_topics? Page = 2 string RL; webrequest myreq = webrequest. Create (httpurldomain + "/recommend_tech_topics? Page = "+ pageindex); webresponse Myres = myreq. getresponse (); stream resstream = Myres. getresponsestream (); streamreader sr = new streamreader (resstream, encoding. utf8); stringbuilder sb = new stringbuilder (); While (RL = sr. readline ())! = NULL) {sb. appendline (RL);} RegEx = new RegEx ("<Div class = \" list_1 \ "> ([\ s] *) </div> ([\ s] *) <Div class = \ "page_nav \"> ", regexoptions. compiled); match = RegEx. match (sb. tostring (); If (match. success) This. pageurlinfo. innerhtml = match. groups [0]. value; Myres. close () ;}/// <summary> // obtain the page number /// </Summary> Public String pageindex {get {return request. querystring ["page"]! = NULL? (Int. parse (request. querystring ["page"]. tostring ()> 0? Request. querystring ["page"]. tostring (): "1"): "1" ;}} Public String httpurldomain {get {return "http://bbs.csdn.net ";}}

Okay, let's see how it works:

Well, if we want to get the corresponding content to the database, how can we deal with it? It's easy to modify the Regular Expression and then write the matched objects into a temporary datatable one by one, importing DT to the database or creating SQL statements (it is better to build slq statements, of course, it must be parameterized ).

match.Groups[i].Value

Is the matched value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.