On a "network crawler with C #" We have implemented the network communication part, then continue to discuss the crawler implementation
3. Save the paging file
This part can be simple and complex, if you simply save the HTML code all, the direct storage of the file is OK.
1 private void Savecontents (string html, string url) 2 {3 if (string. IsNullOrEmpty (HTML))//Determine if the HTML string is valid 4 {5 return; 6 } 7 string path = string. Format ("{0}\\{1}.txt", _path, _index++); Generate file name 8 9 try10 {One using (StreamWriter fs = new StreamWriter (path)) {fs . Write (HTML); Write file }15 }16 catch (IOException IoE) + MessageBox.Show ("savecontents IO" + IoE. Message + "path=" + path), }20 if (contentssaved! = null) @ { _ui. Dispatcher.invoke (contentssaved, path, URL); Invoke Save File event }25}
Line 23rd There is another event that is triggered after the file is saved, and the client program can be registered before.
1 public delegate void Contentssavedhandler (string path, string url), 2 3///<SUMMARY>4///file is saved locally and triggered 5//</SU Mmary>6 public event Contentssavedhandler contentssaved = null;
4. Extract page Links
Extract links with regular expressions can be done, do not understand the Internet search.
The following string will match the link in the page
http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)?
See Code in detail
1 private string[] Getlinks (string html) 2 {3 const string pattern = @ "http://([\w-]+\.) +[\w-]+ (/[\w-./?%&=]*)? "; 4 Regex r = new Regex (pattern, regexoptions.ignorecase);//new Regular mode 5 matchcollection m = r.matches (HTML);//Get matching Results 6 string[] links = new String[m.count]; 7 8 for (int i = 0; i < M.count; i++) 9 { links[i] = m[i]. ToString (); Extract the results one by one }12 return links;13}
5. Filtering of links
Not all links we need to download, so filter to get rid of the links we don't need
These links are generally:
- Links that have been downloaded
- Links with too much depth
- Other unwanted resources, slices, CSS, etc.
1//Determine if the link has been downloaded or is already in the not downloaded Collection 2 private bool Urlexists (string URL) 3 {4 bool result = _urlsunload.containskey (URL); 5 Result |= _urlsloaded.containskey (URL); 6 return result; 7} 8 9 private bool Urlavailable (string url), {One-by-one (urlexists (URL))-{return false;//already exists 14 }15 if (URL. Contains (". jpg") | | Url. Contains (". gif") 16 | | Url. Contains (". png") | | Url. Contains (". css") 17 | | Url. Contains (". js")) * {return false;//Remove some of the pictures and other resources.}21 return true;22}23 private void Addurls (ST ring[] URLs, int depth) (depth >= _maxdepth), {+-return;//depth too large}30 foreach (Strin g URL in URLs)-{+ string cleanurl = URL. Trim (); Remove the space before and after Cleanurl = Cleanurl.trimend ('/'); Unified removal of the last '/' if (urlavailable (Cleanurl)) (Cleanurl.contains (_baseurl)) 37 {_urlsunload.add (cleanurl, depth);//is within the chain, directly into the not downloaded collection 39 }40 else41 {42//Outer chain processing 43}44}45}46}
The 34th row of _baseurl is the base address of the crawl, such as http://news.sina.com.cn/, which will be saved as news.sina.com.cn, and when a URL contains this string, the description is the link under that base site, otherwise it is an outer chain.
_baseurl is handled as follows, _rooturl is the first URL to download
1//<summary> 2//download root URL 3//</summary> 4 public string Rooturl 5 {6 get 7 {8 return _root URL; 9 }10 set11 { !value. Contains ("http.//")) * {_rooturl = "http//" + value;15 }16 else17 { _rooturl = value;19 }20 _baseurl = _rooturl.replace ("www.", ""); The whole station is removed www21 _baseurl = _baseurl.replace ("http://", ""); Remove the protocol name _baseurl = _baseurl.trimend ('/');//Remove the end of '/' at }24}
At this point, the basic crawler function implementation is finished.
Finally attach the source code and the demo program, the crawler source in Spider.cs, the demo is a WPF program, test is a single-threaded version of the console.
Baidu Cloud Network Disk Link: Http://pan.baidu.com/s/1pKMfI8F Password: 3vzh
GJM: Reprinted from http://www.cnblogs.com/Jiajun/archive/2012/06/17/2552458.html in 2016-11-16 please contact me if the author's copyright is affected [email protected]
In the next installment, we'll show you some ways to extract effective information from a Web page, so please look forward to ...
GJM: Using C # to implement web crawler (ii) [reprint]