GJM: Using C # to implement web crawler (ii) [reprint]

Last Update:2016-11-16 Source: Internet

Author: User

Tags save file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

On a "network crawler with C #" We have implemented the network communication part, then continue to discuss the crawler implementation

3. Save the paging file

This part can be simple and complex, if you simply save the HTML code all, the direct storage of the file is OK.

1 private void Savecontents (string html, string url) 2 {3     if (string. IsNullOrEmpty (HTML))//Determine if the HTML string is valid 4     {5         return; 6     } 7     string path = string. Format ("{0}\\{1}.txt", _path, _index++); Generate file name 8  9     try10 {One         using (StreamWriter fs = new StreamWriter (path))         {fs             . Write (HTML); Write file         }15     }16     catch (IOException IoE)         + MessageBox.Show ("savecontents IO" + IoE. Message + "path=" + path),     }20     if (contentssaved! = null) @     {         _ui. Dispatcher.invoke (contentssaved, path, URL); Invoke Save File event     }25}

Line 23rd There is another event that is triggered after the file is saved, and the client program can be registered before.

1 public delegate void Contentssavedhandler (string path, string url), 2 3///<SUMMARY>4///file is saved locally and triggered 5//</SU Mmary>6 public event Contentssavedhandler contentssaved = null;

4. Extract page Links

Extract links with regular expressions can be done, do not understand the Internet search.

The following string will match the link in the page

http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)?

See Code in detail

1 private string[] Getlinks (string html) 2 {3     const string pattern = @ "http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)? ";  4     Regex r = new Regex (pattern, regexoptions.ignorecase);//new Regular mode 5     matchcollection m = r.matches (HTML);//Get matching Results 6     string[] links = new String[m.count];  7  8 for     (int i = 0; i < M.count; i++) 9     {         links[i] = m[i]. ToString (); Extract the results one by one     }12     return links;13}

5. Filtering of links

Not all links we need to download, so filter to get rid of the links we don't need

These links are generally:

Links that have been downloaded
Links with too much depth
Other unwanted resources, slices, CSS, etc.

 1//Determine if the link has been downloaded or is already in the not downloaded Collection 2 private bool Urlexists (string URL) 3 {4 bool result = _urlsunload.containskey (URL); 5 Result |= _urlsloaded.containskey (URL); 6 return result;     7} 8 9 private bool Urlavailable (string url), {One-by-one (urlexists (URL))-{return false;//already exists 14 }15 if (URL. Contains (". jpg") | | Url. Contains (". gif") 16 | | Url. Contains (". png") | | Url. Contains (". css") 17 | | Url. Contains (". js")) * {return false;//Remove some of the pictures and other resources.}21 return true;22}23 private void Addurls (ST ring[] URLs, int depth) (depth >= _maxdepth), {+-return;//depth too large}30 foreach (Strin g URL in URLs)-{+ string cleanurl = URL. Trim (); Remove the space before and after Cleanurl = Cleanurl.trimend ('/');             Unified removal of the last '/' if (urlavailable (Cleanurl)) (Cleanurl.contains (_baseurl)) 37   {_urlsunload.add (cleanurl, depth);//is within the chain, directly into the not downloaded collection 39          }40 else41 {42//Outer chain processing 43}44}45}46}

The 34th row of _baseurl is the base address of the crawl, such as http://news.sina.com.cn/, which will be saved as news.sina.com.cn, and when a URL contains this string, the description is the link under that base site, otherwise it is an outer chain.

_baseurl is handled as follows, _rooturl is the first URL to download

1//<summary> 2//download root URL 3//</summary> 4 public string Rooturl 5 {6     get 7     {8         return _root URL; 9     }10     set11     {         !value. Contains ("http.//"))             * {_rooturl = "http//" + value;15         }16         else17         {             _rooturl = value;19         }20         _baseurl = _rooturl.replace ("www.", ""); The whole station is removed www21         _baseurl = _baseurl.replace ("http://", ""); Remove the protocol name         _baseurl = _baseurl.trimend ('/');//Remove the end of '/' at     }24}

At this point, the basic crawler function implementation is finished.

Finally attach the source code and the demo program, the crawler source in Spider.cs, the demo is a WPF program, test is a single-threaded version of the console.

Baidu Cloud Network Disk Link: Http://pan.baidu.com/s/1pKMfI8F Password: 3vzh

GJM: Reprinted from http://www.cnblogs.com/Jiajun/archive/2012/06/17/2552458.html in 2016-11-16 please contact me if the author's copyright is affected [email protected]

In the next installment, we'll show you some ways to extract effective information from a Web page, so please look forward to ...

GJM: Using C # to implement web crawler (ii) [reprint]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More