The above mentioned some simple concepts about reptiles and some of the features that crawlers really want to do. This paper briefly analyzes some characteristics of vertical crawler and width (depth) traversal. Now, I am mainly for the vertical crawler architecture design To do some simple introduction.
1. Basic Requirements for vertical crawler
At present, the enterprise level required is basically vertical crawler. Public sentiment analysis, financial information and information recommendation, etc. basic Mountain use vertical crawler to use as enterprise-class solution, the characteristics of enterprise-class crawler I have already spoken in the blog, So when you do a vertical crawler architecture, you just need to think about the features you need to grab the content. Simply put: the way or function you need to get a piece of information. For example: Common JavaScript methods, Ajax, etc.
Simply to list some of the problems I encountered when I grabbed the data:
1.javascript Call Body
2.ajax ways to get the body
3.iframe mode
4. Verification Code
5.javascript calling next page connection
6.javascript+post way to get next page connection
7.ajax ways to get content
8. Login ...
such as.
These are the problems that need to be considered at the beginning of the design crawler, although the crawler is simply to obtain the required data, but many times the data acquisition is not so simple. Therefore, the overall design of the framework is very important, even for the development of the crawler later, if the frame design is not reasonable, Discovering new problems requires re-building crawlers or designing crawlers directly for the problems encountered. Neither of these approaches is desirable.
2. Vertical Type crawler Frame
Previously mentioned, vertical crawler grab data needed for link discovery-Link management-Link download. Then the crawler can be temporarily divided into three small modules according to the three blocks. 1. Link resolution 2. LINK Management 3. Link to download.
Link Resolution:
Then the link discovery simply means that a page is downloaded and the required links are obtained according to certain rules. Such as XPath, Dom, and so on. More commonly, the xpath,java direction of the parsing package has jsoup and so on. To parse the content. Of course, far from being so simple, It's not enough just to jsoup to parse it. But what we're finally going to do is get the link from a page that we want to continue downloading. Then this must be placed after the link download module. Parse the new link after the link download is complete (don't tell me there's no link to download, At first you have to write an entry URL. For example, http://www.baidu.com/search/url_submit.htm Baidu's link to submit, is to let the crawler to catch this page)
Link Download:
Well, this is the main point of the reptile. Very important place, the first request to download the correct content, such as: The body is hidden in the IFRAME, the body is called by JavaScript, the body is required to transcode and so on. Of course the normal site is still a lot of, Simple we just need the normal page open way to get on it. such as Httpclient,htmlparser, even the direct use of Jsoup can get the content.
Jsoup.connect ("();
Of course, this gets to the source file, no JavaScript parsing, and so on. But at the very least, we have successfully acquired the content of the Web page. Although simple, it was successful. Well, the download module can be implemented. So let's just consider that the link download and link discovery steps can be: We enter a URL, link download module to download the link, and then to resolve the module to parse out the link we want to continue to download, according to XPath, Dom and other ways to extract what we need, spam links, advertising links, useless link culling. Then join the link Management module.
Link Management:
The link management module is very important in the whole vertical crawler project, the efficiency, the accuracy and so on are very crucial factors. When the crawler begins to crawl, all the links downloaded in the crawler's life cycle need to be managed by the link management module, with a simple life cycle version and a persistent version. The function that the link manages needs to implement 1. Go to weight 2. Judge
To go to the weight has been said before, if the life cycle is not linked to heavy, it is possible to let the crawler die in this site ... The page has been recycled and cannot be exited. Then the weight is the first question to be considered. It allows the crawler to determine whether the new URL acquired in the crawl has been grabbed before. Whether the next crawl is required.
Judgment is business, judging whether the URL needs to be crawled again. such as a key site, after crawling to find whether the link is still crawling, or a non-focus site, the update frequency is low. Whether to crawl to the site, give up the capture, the resources to update the frequency of the site. So the link management is to get to the link in the first time to go to the weight, judgment. Well, that's the new link found after the link discovery is added to the link management module.
In this way, the basic structure of our crawler is out.
Link Discovery---Link management---Link download---Link discover-loop round
(Will not draw, I hope you understand!) )
Vertical Crawler Architecture Design (2)