C # Production of multithreading enhanced version of the Web crawler _c# tutorial

Source: Internet
Author: User

Last did a help company sister did a reptile, not very exquisite, this company project to use, so have made a modification, the function added the URL picture collection, download, thread processing interface URL picture download.

Talk about the idea: the Prime minister to get the initial URL of all the content in the initial URL collection pictures to the initial URL collection link to collect the link into the queue to continue to collect pictures, and then continue to collect links, Infinite cycle

Let's take a look at the picture.

Processing Web page content crawl and web site crawling has made improvements, the following or everyone to look at the code, there are deficiencies, but also please place!

Web content Crawl Htmlcoderequest,

Web site crawl gethttplinks, use regular to filter links in HTML

Image grab Gethtmlimageurllist, use regular to filter img in HTML

are written into a package class inside the Httphelper

  <summary>///Gets the URL of all the pictures in HTML. </summary>///<param name= "Shtmltext" >html code </param>///<returns> picture URL list </re turns> public static string Htmlcoderequest (String Url) {if (string).
      IsNullOrEmpty (URL)) {return "";
        try {//Create a request HttpWebRequest Httprequst = (HttpWebRequest) webrequest.create (URL); Do not establish persistent link httprequst.
        KeepAlive = true; Sets the method of the request httprequst.
        method = ' Get '; Sets the header value Httprequst. useragent = "user-agent:mozilla/4.0" (compatible; MSIE 6.0;
        Windows NT 5.2;. NET CLR 1.0.3705 "; Httprequst.
        Accept = "*/*"; Httprequst.
        Headers.add ("Accept-language", "zh-cn,en-us;q=0.5"); Httprequst.
        Servicepoint.expect100continue = false; Httprequst.
        Timeout = 5000; Httprequst.
        AllowAutoRedirect = true;//whether to allow 302 servicepointmanager.defaultconnectionlimit = 30;
      Get response  HttpWebResponse webres = (httpwebresponse) httprequst.
        GetResponse (); Gets the text stream string content = String for the response.
        Empty; using (System.IO.Stream Stream = Webres.getresponsestream ()) {using (System.IO.StreamReader reader = NE W StreamReader (Stream, System.Text.Encoding.GetEncoding ("Utf-8")) {content = reader.
          ReadToEnd (); }//Cancel request httprequst.
        Abort ();
      Returns the content of the data;
      catch (Exception) {return ""; }///<summary>///extract page link///</summary>///<param name= "html" ></param>//  /<returns></returns> public static list<string> gethtmlimageurllist (string url) {string html
      = Httphelper.htmlcoderequest (URL); if (string.
      IsNullOrEmpty (HTML)) {return new list<string> (); }//define regular expression to match the img tag regex regimg = new Regex (@ "]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*["']? [\s\t\r\n]* (? [^\s\t\r\n "" ' <>]*) [^<>]*?/?]

      [\s\t\r\n]*> ", regexoptions.ignorecase);
      Search for matching strings matchcollection matches = regimg.matches (HTML);

      list<string> surllist = new list<string> (); Gets the match list foreach (match match in matches) Surllist.add (match. groups["Imgurl"].
      Value);
    return surllist; ///<summary>///Extract page links///</summary>///<param name= "html" ></param>//
      /<returns></returns> public static list<string> gethttplinks (string url) {//Get URL content
      String html = httphelper.htmlcoderequest (URL); if (string.
      IsNullOrEmpty (HTML)) {return new list<string> (); ///matching HTTP link const string pattern2 = @ "http (s)?:/ /([\w-]+\.)
      +[\w-]+ (/[\w-/?%&=]*)? ";
      Regex r2 = new Regex (pattern2, regexoptions.ignorecase);
    Get matching Results  MatchCollection m2 = R2.
      Matches (HTML);
      List<string> links = new list<string> (); foreach (Match url2 in m2) {if Stringhelper.checkurlislegal (url2. ToString ()) | | ! Stringhelper.ispureurl (URL2. ToString ()) | | Links. Contains (URL2.
        ToString ())) continue; Links. ADD (URL2.
      ToString ()); //Match href inside link const string pattern = @ "(? i) <a\s[^>]*?href= (['"]?) (?!
      Javascript|__dopostback) (? <url>[^ ' "\s*#<>]+) [^>]*>";;
      Regex r = new Regex (pattern, regexoptions.ignorecase);
      Get the match result matchcollection m = r.matches (HTML); foreach (Match url1 in m) {string href1 = Url1. groups["url"].
        Value; if (!HREF1.
        Contains ("http")) {href1 = Global.weburl + href1; } if (! Stringhelper.ispureurl (HREF1) | | Links.
        Contains (HREF1)) continue; Links.
      ADD (HREF1);
    return links;  

 }

The

Download picture here has a task bar limit of 200. If the thread waits more than 5 seconds, download the picture here is the delegate for the asynchronous call

public string downloadimg (string url) {if (!string). IsNullOrEmpty (URL)) {try {if (!url).
          Contains ("http")) {url = global.weburl + URL;
          HttpWebRequest request = (HttpWebRequest) webrequest.create (URL); Request.
          Timeout = 2000; Request. useragent = "user-agent:mozilla/4.0" (compatible; MSIE 6.0;
          Windows NT 5.2;. NET CLR 1.0.3705 "; Whether to allow 302 request.
          AllowAutoRedirect = true; WebResponse response = Request.
          GetResponse (); Stream reader = response.
          GetResponseStream (); FileName string afirstname = Guid.NewGuid ().
          ToString (); Extension string alastname = URL. Substring (URL. LastIndexOf (".") + 1, (URL. Length-url.
          LastIndexOf (".")-1); FileStream writer = new FileStream (Global.floderurl + afirstname + "." + Alastname, FileMode.OpenOrCreate, Fileaccess.writ
          e);
          byte[] buff = new byte[512]; ActualNumber of bytes read int c = 0; while (c = reader. Read (buff, 0, buff. Length)) > 0) {writer.
          Write (buff, 0, c); } writer.
          Close (); Writer.
          Dispose (); Reader.
          Close (); Reader.
          Dispose (); Response.
          Close ();
        Return (Afirstname + "." + Alastname);
        catch (Exception) {return "error: Address" + URL;
    Return "Error: Address is empty";

 }

Words do not say more, more need everyone to improve their own slightly! Readers are welcome to communicate with the landlord.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.