Technical Selection of. net crawler frameworks and. net crawler frameworks

Source: Internet
Author: User

Technical Selection of. net crawler frameworks and. net crawler frameworks

I personally think that the crawler framework is divided into the crawling framework and the analysis framework.

1) capture the frame

. Net does not seem to be good on the market. There are two options: 1. light type and 2. Heavy Type.

1. the lightweight model allows you to customize some special functions or plug-in switches. High overall performance and fast speed.

Webclient, httprequest, and httpclient. Or write it directly through socket!

2. heavy-weight browsers can be used in the basic mode, which is more foolish and basically blocks anti-crawler mechanisms.

For example, webbrower or other. net frameworks encapsulated by the webkit browser kernel.


Special capture functions include: cookie support (default), 301 automatic jump, https default support, gzip compression and Other Default support, automatic identification of encoding in multiple ways, default simulation of browser header, simulate css and js execution.

Of course, the more powerful the function, the worse the performance, but the more powerful the ability to adapt to various situations (Anti-crawler capability), the light type and heavy type adapt to different crawling scenarios.


Technical options:

HttpHelper)

ScrapingBrowser in scrapysharp

. Net HttpWebRequest in simple Encapsulation

. Net webclient in simple Encapsulation


2) Analysis Framework

Old technology: Regular Expressions

New Methods: scrapysharp, HtmlAgilityPack, CsQuery, and so on (many more)

Scrapysharp: extended from HtmlAgilityPack, which is very easy to use. (Supports css selector for quick start)

Http://www.cnblogs.com/arxive/p/7075306.html

 

HtmlAgilityPack: it is easy to use, but some algorithms must be processed when used. (Quick Start With xpath is supported)

Baidu, there are a lot of materials.

 

CsQuery: it seems that there is a bug in Chinese support. Chinese characters are garbled during html retrieval, and I don't know why. (Jq is supported for quick start)

Https://github.com/jamietre/CsQuery

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.