Technical Selection of. net crawler frameworks and. net crawler frameworks
I personally think that the crawler framework is divided into the crawling framework and the analysis framework.
1) capture the frame
. Net does not seem to be good on the market. There are two options: 1. light type and 2. Heavy Type.
1. the lightweight model allows you to customize some special functions or plug-in switches. High overall performance and fast speed.
Webclient, httprequest, and httpclient. Or write it directly through socket!
2. heavy-weight browsers can be used in the basic mode, which is more foolish and basically blocks anti-crawler mechanisms.
For example, webbrower or other. net frameworks encapsulated by the webkit browser kernel.
Special capture functions include: cookie support (default), 301 automatic jump, https default support, gzip compression and Other Default support, automatic identification of encoding in multiple ways, default simulation of browser header, simulate css and js execution.
Of course, the more powerful the function, the worse the performance, but the more powerful the ability to adapt to various situations (Anti-crawler capability), the light type and heavy type adapt to different crawling scenarios.
Technical options:
HttpHelper)
ScrapingBrowser in scrapysharp
. Net HttpWebRequest in simple Encapsulation
. Net webclient in simple Encapsulation
2) Analysis Framework
Old technology: Regular Expressions
New Methods: scrapysharp, HtmlAgilityPack, CsQuery, and so on (many more)
Scrapysharp: extended from HtmlAgilityPack, which is very easy to use. (Supports css selector for quick start)
Http://www.cnblogs.com/arxive/p/7075306.html
HtmlAgilityPack: it is easy to use, but some algorithms must be processed when used. (Quick Start With xpath is supported)
Baidu, there are a lot of materials.
CsQuery: it seems that there is a bug in Chinese support. Chinese characters are garbled during html retrieval, and I don't know why. (Jq is supported for quick start)
Https://github.com/jamietre/CsQuery