Recently, the Garden Network crawler is very popular, from PHP to Python, from Windows services to WinForm programs, the big God recount. The younger brother also presents the ugly, from the mediocre flow embarks, briefly the next WebApi +angularjs Way realizes the network crawler.
First, technical framework 1.1 front-end:
AngularJS, create a Spa (single page app). Crawlers need to wait for a long time for the server to return results, which must be implemented using AJAX, as can be done with jquery.
1.2 Back end:
WebApi. AngularJS with WEBAPI use very harmonious.
1.3 Frameworks/Libraries used in backend:
A. ABP, the most recent garden fire in the basic framework, the greatest advantage is the implementation of sub-modules, including the Appservice (WEBAPI) layer of the sub-module, after use, the code structure is very clear. For details, see the Modern ASP. NET Development framework based on DDD--ABP series articles List
B. Crossbred Forum c#httphelper Universal Framework, for the collection of HTML pages, you can directly use the "HTTP Proxy" read, this is very important!
One thing to note is that the framework is not free and needs to be an annual fee member to download.
C. Ivony's Jumony Library, Project description: "Jumony Core first provides a near-perfect HTML parsing engine, which results in an infinite approximation of the parsing results of the browser. Support for CSS3 selectors. ”
Second, the realization of technology2.1, crawl the free Http proxy address.
This Baidu line, you can search for a large number of HTTP proxy URLs to provide Web sites, the first of these free HTTP agents to crawl large their own agent library, the second and third steps need to use these agents. When using, to record the agent's availability, set the policy, the number of failures, rejected.
Of course, local tyrants directly buy the agent can, high stability.
Here is a list of the agents I crawled:
Banned agents (I set the policy to fail more than 3 success times, that is, discard):
2.2. Read the list of articles (single thread)
With enough HTTP proxy lists, you can crawl the web.
Backend implementation: Use Httphelper to crawl Web pages, and then use Jumony to analyze page content. Also record the success/failure of each Http proxy.
Front-end control flow: According to the results of the agent fetch, determine whether the crawl success. If successful, continue to fetch the next page, and if it fails, a different agent continues to fetch the current page.
Because the list of articles is not too much, a single thread crawl is enough.
:
2.3. Read the article (multithreading)
After grabbing a large number of unread articles in the second step, you need to crawl the content of the article. Because of the large volume, through multi-threaded implementation.
The so-called multi-threading, is the simultaneous invocation of multiple WEBAPI processes through Ajax, monitoring returns results.
:
After clicking "Start reading"
After clicking "Stop reading":
Third, PostScript
There is no advanced knowledge, advanced parts are AngularJS, ABP, Httphelper, Jomony realize, so it is the realization of mediocre technology flow.
Above.
Mediocre technology flow, using WEBAPI +angularjs to realize web crawler