The story of the cause of Ctrip's travel network, a technical manager, Hao said the heroic threat to pass his ultra-high IQ, perfect crush crawler developers, as an amateur crawler development enthusiasts, such statements I certainly can not ignore. Therefore, a basic crawler and this advanced Crawler development tutorial was born.
Some people commented on my previous simple crawler: The code is too simple to explode, it is really a group of cultural children to thunder! I have to guess if you are a Ctrip, I haven't finished writing, how do you know the weak explosion? It seems that there is no more material you are not satisfied Ah!
Today we will learn the development of advanced reptiles, and we will use the previous simple crawler to implement the links master part of the distributed crawler, in order to improve the efficiency of distributed crawling.
The following is what we want to talk about, involving a lot of open source software. Don't be too nervous, the more advanced things are usually packaged the better, as long as the mentality of the comprehensive use of the line, I first assume that you have the following tools have been understood:
RabbitMQ: for distributed messaging.
Shadowsocks: Used for proxy encryption.
PHANTOMJS: For Web page rendering.
Selenium: For Web Automation control.
First, what is a high-level crawler?
The high-level crawlers we talked about often say that it has the browser's running characteristics and requires the support of a third-party class library or tool, such as the following common stuff:
Webkit
WebBrowser
PHANTOMJS + Selenium
Many people think that the distributed crawler can be considered a high-level crawler. This is definitely a false understanding that distribution is just a means for us to implement the crawler architecture, not to define its high-level factors.
We call them advanced crawler components, mainly because they can not only crawl the Web page source code, but also can render the Web page of HTML, CSS, JavaScript and other content.
What is the benefit of such a feature for development crawlers? Speaking of this advantage that is a little modest, silk no exaggeration said: This thing can be called "Climb invincible"!!!
I guess you still have this look, because of its powerful mechanism, so that we can directly on the site page: executing javascript code, triggering various mouse keyboard events, manipulating the page DOM structure, using XPath syntax to crawl data, can do almost anything on the browser can do.
Many websites use Ajax to dynamically load and page pages, such as Ctrip's comment data. If using the previous simple crawler, it is difficult to crawl directly to all the comment data, we need to analyze the sky of JavaScript code to find API data interface, but also always beware of each other to increase the data trap or modify API interface.
With advanced crawlers, you can completely ignore these problems, no matter how they encrypt the JavaScript code to hide the API interface, the final data must be presented in the DOM structure on the Site page, otherwise the ordinary user will not be able to see. So we can simply not parse the API data interface, extract the data directly from the DOM, or even write that complex regular expression.
Second, how to develop a high-level crawler?
Now we're going to step into this advanced crawler, and then we'll use the current two components to complete a basic function of the Advanced crawler, first we go to download open source components:
PHANTOMJS: As a browser without a UI interface, mainly for the implementation of page Automation testing, we use its page parsing function, the implementation of Web site content crawl. After downloading the extract, copy the Phantomjs.exe file from the Bin folder to any folder under your crawler project, we only need this.
: http://phantomjs.org/download.html
Selenium: is an automated testing tool that encapsulates a lot of webdriver used to communicate with the browser kernel, and I use the development language to invoke it to implement PHANTOMJS automation. There are a lot of things in the download page, we just need to selenium Client, it supports a lot of languages (C #, JAVA, Ruby, Python, NodeJS), according to the language you learn to download.
: http://docs.seleniumhq.org/download/
Here I am I download the C # language client and add all 4 DLL files to the project reference, and other language developers should find their own way and start our coding journey.
The usual, open visual Studio 2015 to create a new console application, add a simple Strongcrawler class, because the two reptiles have a common part, in line with the principle of dry, we need to refactor some of the code, we first extract a Icrawler interface:
Then we use the Strongcrawler class to implement this interface:
Then we will write its asynchronous crawler method:
Well, the basic function of this advanced crawler is defined, or with Ctrip's hotel data as an example of crawling, we test the details of the crawl (hotel name, address, rating, price, number of reviews, comment on the current page, comments on the next page, comments on total pages, number of comments per page). We are now using the console program to invoke:
By the know, waiting for the hotel page to load, we find the page element through the XPath syntax, first click on the page "hotel review" button, and then wait for the page DOM structure changes, that is, waiting for the Ajax to load successfully, and then to crawl the required data. Look at the results of the code execution:
We easily crawled into the hotel's information, as well as the hotel's first page of all review data. Because Ctrip's comment data is paged through Ajax, you want to crawl all the comments and crawl the page numbers of the comments. And look at execution performance:
Pretty good, 484 milliseconds, you can say that in all advanced crawler components, the efficiency of PHANTOMJS should be the highest, and almost no other component can directly contend with it. With the page number data, we can perform page-by-page fetching of comments, and at this rate, crawling hundreds of of pages of comment data doesn't need to be distributed at all.
Third, how to achieve distributed?
Distributed crawlers are usually implemented using Message Queuing, and today we introduce a very windy distributed Message Queuing open source component, which is very much open source Message Queuing on the Internet:
RABBITMQ is an open-source AMQP implementation that is written in Erlang and supports a wide range of clients such as. NET, Python, Ruby, Java, JMS, C, PHP, ActionScript, XMPP, stomp, and so on, supporting Ajax. For storing forwarded messages in distributed systems, it is very good in terms of ease of use, extensibility, high availability, and so on.
: http://www.rabbitmq.com/download.html
A distributed crawler typically requires two terminals:
Control terminal is mainly responsible for controlling crawler operation, monitoring crawler status, configuration crawler crawl mode. The crawler owner's function is to crawl the data and submit the data to the data cleansing service.
The crawler side also needs to separate master crawlers and worker crawlers, master crawler mainly uses the simple crawler operation mode to achieve high-performance Hyper-link (links) crawl. The worker crawler uses advanced crawler features to capture refined data, such as AJAX-loaded content. Give the best thing to the most suitable crawler to do.
Smart you should think that the way they communicate is Message Queuing. The Master crawler simply throws the crawled links into the data crawl queue. The worker crawler pulls the links in the queue to achieve data fetching, and then submits the data to the data cleansing queue after the crawl is completed.
Should all the principles be clear? Then you implement the code, RABBITMQ official online has sample code, I will no longer be here wordy.
Iv. how to achieve a stable encryption agent?
In this internet age, the free things are almost gone, even if there is a lot of rubbish. So today I want to talk about the Shadowsocks, is also a need to pay a small amount of things, the strong point of this thing is that its flow characteristics are not obvious, can be very stable to provide Internet agents.
: Https://github.com/shadowsocks
The Shadowsocks client opens a SOCKS5 proxy locally, and the proxy's network access request is sent by the client to the server, the server makes the request, receives the response data, and then sends it back to the client. In the middle of the AES-256 to encrypt the transmission of data, so the normal proxy server is much more secure, we look at how it works:
It is known from the diagram that it needs to run the client program locally and encrypt the communication with the service-side program that connects to the remote proxy server. Then the proxy port is simulated locally, so that the native traffic is encrypted by the local client and then transferred to the remote server to complete the proxy forwarding service.
So we just need to buy a Linux-based VPS server, the cost of about 15 yuan per month, after the installation of a good service, you can achieve a very stable cryptographic proxy service. Related teaching materials on the Internet a lot of, I also no longer wordy here.
V. Concluding remarks
Under some pressure, I am not here to publish detailed crawler source code, see the above example will certainly be able to complete a more powerful advanced crawler.
Based on C #. NET high-end Intelligent Network Crawler (ii) (Breach Ctrip)