GJM: Using C # to implement web crawler (ii) [reprint]

Source: Internet
Author: User
Tags save file

On a "network crawler with C #" We have implemented the network communication part, then continue to discuss the crawler implementation

3. Save the paging file

This part can be simple and complex, if you simply save the HTML code all, the direct storage of the file is OK.

1 private void Savecontents (string html, string url) 2 {3     if (string. IsNullOrEmpty (HTML))//Determine if the HTML string is valid 4     {5         return; 6     } 7     string path = string. Format ("{0}\\{1}.txt", _path, _index++); Generate file name 8  9     try10 {One         using (StreamWriter fs = new StreamWriter (path))         {fs             . Write (HTML); Write file         }15     }16     catch (IOException IoE)         + MessageBox.Show ("savecontents IO" + IoE. Message + "path=" + path),     }20     if (contentssaved! = null) @     {         _ui. Dispatcher.invoke (contentssaved, path, URL); Invoke Save File event     }25}

Line 23rd There is another event that is triggered after the file is saved, and the client program can be registered before.

1 public delegate void Contentssavedhandler (string path, string url), 2 3///<SUMMARY>4///file is saved locally and triggered 5//</SU Mmary>6 public event Contentssavedhandler contentssaved = null;

4. Extract page Links

Extract links with regular expressions can be done, do not understand the Internet search.

The following string will match the link in the page

http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)?

See Code in detail

1 private string[] Getlinks (string html) 2 {3     const string pattern = @ "http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)? ";  4     Regex r = new Regex (pattern, regexoptions.ignorecase);//new Regular mode 5     matchcollection m = r.matches (HTML);//Get matching Results 6     string[] links = new String[m.count];  7  8 for     (int i = 0; i < M.count; i++) 9     {         links[i] = m[i]. ToString (); Extract the results one by one     }12     return links;13}


5. Filtering of links

Not all links we need to download, so filter to get rid of the links we don't need

These links are generally:

    • Links that have been downloaded
    • Links with too much depth
    • Other unwanted resources, slices, CSS, etc.
 1//Determine if the link has been downloaded or is already in the not downloaded Collection 2 private bool Urlexists (string URL) 3 {4 bool result = _urlsunload.containskey (URL); 5 Result |= _urlsloaded.containskey (URL); 6 return result;     7} 8 9 private bool Urlavailable (string url), {One-by-one (urlexists (URL))-{return false;//already exists 14 }15 if (URL. Contains (". jpg") | | Url. Contains (". gif") 16 | | Url. Contains (". png") | | Url. Contains (". css") 17 | | Url. Contains (". js")) * {return false;//Remove some of the pictures and other resources.}21 return true;22}23 private void Addurls (ST ring[] URLs, int depth) (depth >= _maxdepth), {+-return;//depth too large}30 foreach (Strin g URL in URLs)-{+ string cleanurl = URL. Trim (); Remove the space before and after Cleanurl = Cleanurl.trimend ('/');             Unified removal of the last '/' if (urlavailable (Cleanurl)) (Cleanurl.contains (_baseurl)) 37   {_urlsunload.add (cleanurl, depth);//is within the chain, directly into the not downloaded collection 39          }40 else41 {42//Outer chain processing 43}44}45}46} 

The 34th row of _baseurl is the base address of the crawl, such as http://news.sina.com.cn/, which will be saved as news.sina.com.cn, and when a URL contains this string, the description is the link under that base site, otherwise it is an outer chain.

_baseurl is handled as follows, _rooturl is the first URL to download

1//<summary> 2//download root URL 3//</summary> 4 public string Rooturl 5 {6     get 7     {8         return _root URL; 9     }10     set11     {         !value. Contains ("http.//"))             * {_rooturl = "http//" + value;15         }16         else17         {             _rooturl = value;19         }20         _baseurl = _rooturl.replace ("www.", ""); The whole station is removed www21         _baseurl = _baseurl.replace ("http://", ""); Remove the protocol name         _baseurl = _baseurl.trimend ('/');//Remove the end of '/' at     }24}


At this point, the basic crawler function implementation is finished.

Finally attach the source code and the demo program, the crawler source in Spider.cs, the demo is a WPF program, test is a single-threaded version of the console.

Baidu Cloud Network Disk Link: Http://pan.baidu.com/s/1pKMfI8F Password: 3vzh

GJM: Reprinted from http://www.cnblogs.com/Jiajun/archive/2012/06/17/2552458.html in 2016-11-16 please contact me if the author's copyright is affected [email protected]

In the next installment, we'll show you some ways to extract effective information from a Web page, so please look forward to ...

GJM: Using C # to implement web crawler (ii) [reprint]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.