C # Production of multithreading enhanced version of the Web crawler

C # Production of multithreading enhanced version of the Web crawler _c# tutorial

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last did a help company sister did a reptile, not very exquisite, this company project to use, so have made a modification, the function added the URL picture collection, download, thread processing interface URL picture download.

Talk about the idea: the Prime minister to get the initial URL of all the content in the initial URL collection pictures to the initial URL collection link to collect the link into the queue to continue to collect pictures, and then continue to collect links, Infinite cycle

Let's take a look at the picture.

Processing Web page content crawl and web site crawling has made improvements, the following or everyone to look at the code, there are deficiencies, but also please place!

Web content Crawl Htmlcoderequest,

Web site crawl gethttplinks, use regular to filter links in HTML

Image grab Gethtmlimageurllist, use regular to filter img in HTML

are written into a package class inside the Httphelper

  <summary>///Gets the URL of all the pictures in HTML. </summary>///<param name= "Shtmltext" >html code </param>///<returns> picture URL list </re turns> public static string Htmlcoderequest (String Url) {if (string).
      IsNullOrEmpty (URL)) {return "";
        try {//Create a request HttpWebRequest Httprequst = (HttpWebRequest) webrequest.create (URL); Do not establish persistent link httprequst.
        KeepAlive = true; Sets the method of the request httprequst.
        method = ' Get '; Sets the header value Httprequst. useragent = "user-agent:mozilla/4.0" (compatible; MSIE 6.0;
        Windows NT 5.2;. NET CLR 1.0.3705 "; Httprequst.
        Accept = "*/*"; Httprequst.
        Headers.add ("Accept-language", "zh-cn,en-us;q=0.5"); Httprequst.
        Servicepoint.expect100continue = false; Httprequst.
        Timeout = 5000; Httprequst.
        AllowAutoRedirect = true;//whether to allow 302 servicepointmanager.defaultconnectionlimit = 30;
      Get response  HttpWebResponse webres = (httpwebresponse) httprequst.
        GetResponse (); Gets the text stream string content = String for the response.
        Empty; using (System.IO.Stream Stream = Webres.getresponsestream ()) {using (System.IO.StreamReader reader = NE W StreamReader (Stream, System.Text.Encoding.GetEncoding ("Utf-8")) {content = reader.
          ReadToEnd (); }//Cancel request httprequst.
        Abort ();
      Returns the content of the data;
      catch (Exception) {return ""; }///<summary>///extract page link///</summary>///<param name= "html" ></param>//  /<returns></returns> public static list<string> gethtmlimageurllist (string url) {string html
      = Httphelper.htmlcoderequest (URL); if (string.
      IsNullOrEmpty (HTML)) {return new list<string> (); }//define regular expression to match the img tag regex regimg = new Regex (@ "]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*["']? [\s\t\r\n]* (? [^\s\t\r\n "" ' <>]*) [^<>]*?/?]

      [\s\t\r\n]*> ", regexoptions.ignorecase);
      Search for matching strings matchcollection matches = regimg.matches (HTML);

      list<string> surllist = new list<string> (); Gets the match list foreach (match match in matches) Surllist.add (match. groups["Imgurl"].
      Value);
    return surllist; ///<summary>///Extract page links///</summary>///<param name= "html" ></param>//
      /<returns></returns> public static list<string> gethttplinks (string url) {//Get URL content
      String html = httphelper.htmlcoderequest (URL); if (string.
      IsNullOrEmpty (HTML)) {return new list<string> (); ///matching HTTP link const string pattern2 = @ "http (s)?:/ /([\w-]+\.)
      +[\w-]+ (/[\w-/?%&=]*)? ";
      Regex r2 = new Regex (pattern2, regexoptions.ignorecase);
    Get matching Results  MatchCollection m2 = R2.
      Matches (HTML);
      List<string> links = new list<string> (); foreach (Match url2 in m2) {if Stringhelper.checkurlislegal (url2. ToString ()) | | ! Stringhelper.ispureurl (URL2. ToString ()) | | Links. Contains (URL2.
        ToString ())) continue; Links. ADD (URL2.
      ToString ()); //Match href inside link const string pattern = @ "(? i) <a\s[^>]*?href= (['"]?) (?!
      Javascript|__dopostback) (? <url>[^ ' "\s*#<>]+) [^>]*>";;
      Regex r = new Regex (pattern, regexoptions.ignorecase);
      Get the match result matchcollection m = r.matches (HTML); foreach (Match url1 in m) {string href1 = Url1. groups["url"].
        Value; if (!HREF1.
        Contains ("http")) {href1 = Global.weburl + href1; } if (! Stringhelper.ispureurl (HREF1) | | Links.
        Contains (HREF1)) continue; Links.
      ADD (HREF1);
    return links;  

 }

The

Download picture here has a task bar limit of 200. If the thread waits more than 5 seconds, download the picture here is the delegate for the asynchronous call

public string downloadimg (string url) {if (!string). IsNullOrEmpty (URL)) {try {if (!url).
          Contains ("http")) {url = global.weburl + URL;
          HttpWebRequest request = (HttpWebRequest) webrequest.create (URL); Request.
          Timeout = 2000; Request. useragent = "user-agent:mozilla/4.0" (compatible; MSIE 6.0;
          Windows NT 5.2;. NET CLR 1.0.3705 "; Whether to allow 302 request.
          AllowAutoRedirect = true; WebResponse response = Request.
          GetResponse (); Stream reader = response.
          GetResponseStream (); FileName string afirstname = Guid.NewGuid ().
          ToString (); Extension string alastname = URL. Substring (URL. LastIndexOf (".") + 1, (URL. Length-url.
          LastIndexOf (".")-1); FileStream writer = new FileStream (Global.floderurl + afirstname + "." + Alastname, FileMode.OpenOrCreate, Fileaccess.writ
          e);
          byte[] buff = new byte[512]; ActualNumber of bytes read int c = 0; while (c = reader. Read (buff, 0, buff. Length)) > 0) {writer.
          Write (buff, 0, c); } writer.
          Close (); Writer.
          Dispose (); Reader.
          Close (); Reader.
          Dispose (); Response.
          Close ();
        Return (Afirstname + "." + Alastname);
        catch (Exception) {return "error: Address" + URL;
    Return "Error: Address is empty";

 }

Words do not say more, more need everyone to improve their own slightly! Readers are welcome to communicate with the landlord.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

C # Production of multithreading enhanced version of the Web crawler _c# tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

C # Production of multithreading enhanced version of the Web crawler _c# tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support