. NET open source web crawler abot Introduction

Source: Internet
Author: User
Tags null null

. NET is also a lot of open-source crawler tools, Abot is one of them. Abot is an open source. NET Crawler, fast, easy to use and extensible. The address of the project is https://code.google.com/p/abot/

For crawled HTML, the analysis tool used is csquery, csquery can be considered a jquery implemented in. NET, and you can work with HTML pages in a way similar to jquery. Csquery's project address is Https://github.com/afeiship/CsQuery

I. Configuration of the Abot crawler

1. Through property settings

Create the Config object first, and then set the various properties in config:

Crawlconfiguration Crawlconfig =Newcrawlconfiguration (); Crawlconfig.crawltimeoutseconds= -; Crawlconfig.maxconcurrentthreads=Ten; Crawlconfig.maxpagestocrawl= +; crawlconfig.useragentstring="Abot v1.0 Http://code.google.com/p/abot"; CRAWLCONFIG.CONFIGURATIONEXTENSIONS.ADD ("SomeCustomConfigValue1","1111"); CRAWLCONFIG.CONFIGURATIONEXTENSIONS.ADD ("SomeCustomConfigValue2","2222");

2. Configure via app. Config

Read directly from the configuration file, but can also modify individual properties:

Crawlconfiguration crawlconfig =ten;

3. Apply configuration to Crawler objects

Newnewnull null NULL NULL NULL NULL null);
Second, use crawler, register various events

Reptiles are mainly 4 events, page crawl start, page crawl failed, page does not allow crawling events, links in the page does not allow crawling events.

Here is the sample code:

Crawlergecrawlstartingasync + = crawler_processpagecrawlstarting;//single page crawl startCrawler. Pagecrawlcompletedasync + = crawler_processpagecrawlcompleted;//single page crawl endCrawler. Pagecrawldisallowedasync + = crawler_pagecrawldisallowed;//page is not allowed to crawl eventsCrawler. Pagelinkscrawldisallowedasync + = crawler_pagelinkscrawldisallowed;//page links do not allow crawling eventsvoidCrawler_processpagecrawlstarting (Objectsender, Pagecrawlstartingargs e) {Pagetocrawl Pagetocrawl=E.pagetocrawl; Console.WriteLine ("About to crawl link {0} which is found on page {1}", PageToCrawl.Uri.AbsoluteUri, PageToCrawl.ParentUri.AbsoluteUri);}voidCrawler_processpagecrawlcompleted (Objectsender, Pagecrawlcompletedargs e) {Crawledpage Crawledpage=E.crawledpage; if(Crawledpage.webexception! =NULL|| CrawledPage.HttpWebResponse.StatusCode! =Httpstatuscode.ok) Console.WriteLine ("Crawl of page failed {0}", CrawledPage.Uri.AbsoluteUri); ElseConsole.WriteLine ("Crawl of page succeeded {0}", CrawledPage.Uri.AbsoluteUri); if(string. IsNullOrEmpty (CrawledPage.Content.Text)) Console.WriteLine ("Page had no content {0}", CrawledPage.Uri.AbsoluteUri);}voidCrawler_pagelinkscrawldisallowed (Objectsender, Pagelinkscrawldisallowedargs e) {Crawledpage Crawledpage=E.crawledpage; Console.WriteLine ("Did not crawl the links in page {0} due to {1}", CrawledPage.Uri.AbsoluteUri, E.disallowedreason);}voidCrawler_pagecrawldisallowed (Objectsender, Pagecrawldisallowedargs e) {Pagetocrawl Pagetocrawl=E.pagetocrawl; Console.WriteLine ("Did not crawl page {0} due to {1}", PageToCrawl.Uri.AbsoluteUri, E.disallowedreason);}

Third, add multiple objects to the crawler

Abot should have borrowed from the viewbag in ASP. NET MVC, and also set the object-level Crwalbag and page-level viewbag for the crawler objects.

Politewebcrawler crawler =NewPolitewebcrawler (); crawler. Crawlbag.myfoo1=NewFoo ();//Object-Level CrwalbagCrawler. Crawlbag.myfoo2 =NewFoo (); crawler. Pagecrawlstartingasync+=crawler_processpagecrawlstarting, .....voidCrawler_processpagecrawlstarting (Objectsender, Pagecrawlstartingargs e) {        //get the object in CrwalbagCrawlcontext context =E.crawlcontext; Context. CrawlBag.MyFoo1.Bar ();//using Crwalbagcontext.        CrawlBag.MyFoo2.Bar (); //using page-level pagebagE.pagetocrawl.pagebag.bar =NewBar ();}
Four, start the crawler
Starting the crawler is very simple, call the crawl method, specify a good start page, you can.
Crawlresult result = Crawler. Crawl (NewUri ("http://localhost:1111/"));if(result.) erroroccurred) Console.WriteLine ("Crawl of {0} completed with error: {1}", result. Rooturi.absoluteuri, result. Errorexception.message);ElseConsole.WriteLine ("Crawl of {0} completed without error.", result. Rooturi.absoluteuri);
Five, Introduction csquery

In the Pagecrawlcompletedasync event, E.crawledpage.csquerydocument is a Csquery object.

Here's a look at the advantages of Csquery in analyzing HTML:

Cqdocument.select (". Bigtitle > H1")
 The selector here is exactly the same as jquery, here is the H1 tag under the class. Bittitle. If you can use jquery skillfully, then getting started csquery can be very quick and easy. 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.