NET open source web crawler

Last Update:2015-04-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reproduced. NET open source web crawler abot Introduction

. NET is also a lot of open-source crawler tools, Abot is one of them. Abot is an open source. NET Crawler, fast, easy to use and extensible. The address of the project is https://code.google.com/p/abot/

For crawled HTML, the analysis tool used is csquery, csquery can be considered a jquery implemented in. NET, and you can work with HTML pages in a way similar to jquery. Csquery's project address is Https://github.com/afeiship/CsQuery

I. Configuration of the Abot crawler

1. Through property settings

Create the Config object first, and then set the various properties in config:

Crawlconfiguration crawlconfig = new Crawlconfiguration (); crawlconfig.crawltimeoutseconds = 100; Crawlconfig.maxconcurrentthreads = 10; Crawlconfig.maxpagestocrawl = 1000; crawlconfig.useragentstring = "Abot v1.0 Http://code.google.com/p/abot"; CRAWLCONFIG.CONFIGURATIONEXTENSIONS.ADD ("SomeCustomConfigValue1", "1111"); CRAWLCONFIG.CONFIGURATIONEXTENSIONS.ADD ("SomeCustomConfigValue2", "2222");

2. Configure via app. Config

Read directly from the configuration file, but can also modify individual properties:

Crawlconfiguration crawlconfig = Abotconfigurationsectionhandler.loadfromxml (). Convert (); crawlconfig.crawltimeoutseconds = 100; Crawlconfig.maxconcurrentthreads = 10;

3. Apply configuration to Crawler objects

Politewebcrawler crawler = new Politewebcrawler (); Politewebcrawler crawler = new Politewebcrawler (crawlconfig, NULL, NULL, NULL, NULL, NULL, NULL and NULL);

Second, use crawler, register various events

Reptiles are mainly 4 events, page crawl start, page crawl failed, page does not allow crawling events, links in the page does not allow crawling events.

Here is the sample code:

Crawlergecrawlstartingasync + = crawler_processpagecrawlstarting;//Single page crawl start crawler. Pagecrawlcompletedasync + = crawler_processpagecrawlcompleted;//single page crawl end crawler. Pagecrawldisallowedasync + = crawler_pagecrawldisallowed;//page does not allow crawling of event crawler. Pagelinkscrawldisallowedasync + = crawler_pagelinkscrawldisallowed;//page link does not allow crawl event void Crawler_        Processpagecrawlstarting (object sender, Pagecrawlstartingargs e) {pagetocrawl pagetocrawl = e.pagetocrawl; Console.WriteLine ("About to crawl link {0} which is found on page {1}", PageToCrawl.Uri.AbsoluteUri, Pagetocrawl.parentur I.absoluteuri);} void Crawler_processpagecrawlcompleted (object sender, Pagecrawlcompletedargs e) {crawledpage crawledpage = E.Crawle        Dpage; if (crawledpage.webexception! = NULL | | CrawledPage.HttpWebResponse.StatusCode! = httpstatuscode.ok) Consol        E.writeline ("Crawl of page failed {0}", CrawledPage.Uri.AbsoluteUri); else Console.WriteLine ("Crawl of page succeeded {0}", CrawleDPage.Uri.AbsoluteUri); if (string. IsNullOrEmpty (CrawledPage.Content.Text)) Console.WriteLine ("Page had no Content {0}", CrawledPage.Uri.Absol Uteuri);} void Crawler_pagelinkscrawldisallowed (object sender, Pagelinkscrawldisallowedargs e) {crawledpage crawledpage = E.C        Rawledpage; Console.WriteLine ("Did not crawl the links on page {0} due to {1}", CrawledPage.Uri.AbsoluteUri, E.disallowedreason);} void Crawler_pagecrawldisallowed (object sender, Pagecrawldisallowedargs e) {pagetocrawl pagetocrawl = E.pagetocrawl        ; Console.WriteLine ("Did not crawl page {0} due to {1}", PageToCrawl.Uri.AbsoluteUri, E.disallowedreason);}

Third, add multiple objects to the crawler

Abot should have borrowed from the viewbag in ASP. NET MVC, and also set the object-level Crwalbag and page-level viewbag for the crawler objects.

Politewebcrawler crawler = new Politewebcrawler (); crawler. crawlbag.myfoo1 = new Foo ();//Object-level CrwalBagcrawler.CrawlBag.MyFoo2 = new Foo (); crawler. Pagecrawlstartingasync + = crawler_processpagecrawlstarting;...void crawler_processpagecrawlstarting (object sender, Pagecrawlstartingargs e) {        //Get objects in Crwalbag        crawlcontext context = E.crawlcontext;        Context. CrawlBag.MyFoo1.Bar ();//Use crwalbag        context. CrawlBag.MyFoo2.Bar ();        Use page-level pagebag        E.pagetocrawl.pagebag.bar = new Bar ();}

Four, start the crawler

Starting the crawler is very simple, call the crawl method, specify a good start page, you can.

Crawlresult result = Crawler. Crawl (New Uri ("http://localhost:1111/")), if (result. erroroccurred)        Console.WriteLine ("Crawl of {0} completed with error: {1}", result. Rooturi.absoluteuri, result. Errorexception.message); else        Console.WriteLine ("Crawl of {0} completed without error.", result. Rooturi.absoluteuri);

Five, Introduction csquery

In the Pagecrawlcompletedasync event, E.crawledpage.csquerydocument is a Csquery object.

Here's a look at the advantages of Csquery in analyzing HTML:

Cqdocument.select (". Bigtitle > H1")

The selector here is exactly the same as jquery, and here is the H1 tag under the class for. Bittitle. If you can use jquery skillfully, then getting started csquery can be very quick and easy.

NET open source web crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

NET open source web crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

NET open source web crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support