Crawlers are mainly used for data collection, also known as web crawlers. Many content websites use crawlers to capture data. This series (I do not know how many articles) aims to implement a basic crawler program (framework ). Development language: C #
Crawlers need to filter out the target data from the continuously crawled pages. To capture data continuously, we need to have a set of URLs on each page to simulate access to these URLs to analyze the returned data, then we can obtain the required data based on the html dom structure we analyzed.
Generally, we should have a Root node, that is, the Root URL, and then traverse each of its child nodes like a tree structure. For example, the home page of a website contains the URL 1, URL 2, and URL 3 of each navigation bar ...... That is to say, as long as we get the data on the homepage of the website and filter out all the URLs, we can get the direct subnode of this root URL. Then we can get the data of URL1 to get the direct subnode of URL1, and repeat it like this, and finally we can get a huge tree structure.
Each request for a URL returns data, that is, the source code of the HTML document. To obtain the target data, we need to understand the document structure, and different pages have different document structures. If you don't talk much about it, it's easier for programmers to communicate with each other and directly add code to make it easier to understand.
For Http access, WebClient is used here. Create a new Crawler class library and a new Crawler class. The Code is as follows:
1 using system; 2 using system. collections. generic; 3 using system. LINQ; 4 using system. text; 5 using system. net; 6 using system. io; 7 8 namespace crawler 9 {10 /// <summary> 11 /// crawler 12 /// </Summary> 13 public class crawler 14 {15 /// <summary> 16/ // base uri17 /// </Summary> 18 Public String baseuri {Get; set ;} 19 20 /// <summary> 21 /// constructor 22 /// </Summary> 23 // <Param name = "url"> base URI </param> 24 public crawler (string baseuri) 25 {26 This. baseuri = baseuri; 27} 28 29 // <summary> 30 // constructor 31 /// </Summary> 32 public crawler () {} 33 34 // <summary> 35 // Save the data processing program 36 // </Summary> 37 public action <string> savehandler; 38 39 // <summary> 40 // collect data 41 // </Summary> 42 Public void crawl (string targeturi) 43 {44 WebClient = new WebClient (); 45 46 // set the basic uri47 WebClient. baseaddress = This. baseuri; 48 49 // get stream 50 streamreader reader = new streamreader (WebClient. openread (targeturi), encoding. utf8); 51 52 // read data 53 string html = reader. readtoend (); 54 55 // close stream 56 reader. close (); 57 58 // Save the result 59 this. save (HTML ); 60 61} 62 63 // <summary> 64 // save data 65 // </Summary> 66 // <Param name = "html"> data </ param> 67 private void save (string HTML) 68 {69 savehandler (HTML); 70} 71} 72}
This is a very simple crawling class. We can try to call it to create a new console application. The Code is as follows:
1 using System; 2 using System.Collections.Generic; 3 using System.Linq; 4 using System.Text; 5 using Crawler; 6 7 namespace CrawlerConsole 8 { 9 class Program10 {11 static void Main(string[] args)12 {13 14 Crawler.Crawler crawler = new Crawler.Crawler("http://www.baidu.com");15 16 crawler.SaveHandler = new Action<string>(html => Console.WriteLine(html));17 18 crawler.Crawl("/");19 20 }21 }22 }
Then we press Ctrl + F5 to run it. Then we can see that the source code of the Baidu homepage we crawled to is output to the console.
A Brief Introduction to several notes in the Code:
Base URI: is the relative address of the main URI, my Crawl method parameters to be crawled address, "/" relative to the base URI "http://www.baidu.com" is the page, that is, the home page. Refer to the MSDN explanation,The baseaddress attribute contains a base URI that combines the relative addresses. When you call a method for uploading or downloading data, The WebClient object combines the base URI with the relative address specified in the method call. If the specified absolute URI is used, the WebClient does not use the baseaddress attribute value. To remove the previously set value, set baseaddress to null reference (nothing in Visual Basic) or an empty string ("").
WebClient. OpenRead ():
The openread method creates a stream instance to read the content of the resource specified by the address parameter. This method is blocked when the stream is opened. To continue running while waiting for the stream, use one of the openreadasync methods.
If the baseaddress attribute is not a null string ("") and the address does not contain an absolute Uri, the address must be a relative URI. The URI and baseaddress are combined to form the absolute URI of the requested data. If the querystring attribute is not a null reference (nothing in Visual Basic), append it to address.
For more information about WebClient, see msdn: http://msdn.microsoft.com/zh-cn/library/tt0f69eh (V = vs.100). aspx
First of all, I will try my best to write the next chapter. Recently I am very busy. I will leave my job at the end of the month, and I will continue to work for six days...