This article mainly introduces Node. examples of simple web page capturing functions implemented by js. This article uses libraries such as PhantomJS and node-phantomjs to implement these functions. For more information, see the following: web page capturing is a well-known technology, however, there are still a lot of complexities. Simple Web crawlers are still unable to perform Ajax training, XMLHttpRequest, WebSockets, Flash Sockets, and other modern websites developed by various complex technologies.
We take the basic requirements of our Hubdoc project as an example. In this project, we capture the bill amount and expiration date from the websites of banks, public utilities, and credit card companies, account number and the most important: pdf of recent bills. For this project, I initially adopted a very simple solution (not using the expensive commercial products we are evaluating for the moment) -- a simple crawler project I used in MessageLab/Symantec using Perl. However, the results were not smooth, and spam senders made much simpler websites than banks and public utilities.
How can this problem be solved? We mainly start from the excellent request library developed by Mikea. Send a request in the browser, check the request headers sent in the Network window, and copy these request headers to the code. This process is simple. It is only the process of tracking the process from login to download the PDF file, and then simulating all the requests in this process. In order to make it easier to process similar things and to make Web developers write crawler programs more reasonably, I export the result obtained from HTML to jQuery (using the lightweight cheerio Library), which makes similar work easier, it also makes it easier to use CSS to select sub-elements on a page. The entire process is encapsulated into a framework, which can also do additional work, such as picking up certificates from the database, loading individual robots, and communicating with the UI through socket. io.
This is effective for some web sites, but it is only a JS script, not the node. js code I put on their sites by these companies. For the issues left over, they can be layered Based on complexity, making it very difficult for you to understand what to do to get the login information points. For some sites, I tried to use the request () library for several days, but it was still in vain.
After a close crash, I found node-phantomjs. This library allows me to control the phantomjs headless webkit browser from node, headless indicates that the rendering page is completed in the background without the need to display the device ). This seems to be a simple solution, but there are still some problems that phantomjs cannot avoid:
1. PhantomJS can only tell you whether the page has been loaded, but you cannot determine whether there is any redirection (redirect) implemented through JavaScript or meta tags in this process ). Especially when JavaScript uses setTimeout () to delay the call.
2. phantomJS provides you with a page loading start (pageLoadStarted) Hook that allows you to handle the problems mentioned above, but this function can only be used in the number of pages you are sure to load, this number is reduced when each page is loaded, and it provides processing for possible timeouts (because this does not always happen), so when your number is reduced to 0, you can call your callback function. This method can work, but it seems like a hacker.
3. Each page captured by PhantomJS requires a complete and independent process, because if not, cookies between each page cannot be separated. If you use the same phantomjs process, the session on the logged-on page will be sent to another page.
4. You cannot use PhantomJS to download resources-you can only save the page as png or pdf. This is useful, but it means we need to turn to request () to download pdf.
5. For the above reason, I must find a method to distribute the cookie from the PhantomJS session to the request () session library. You only need to distribute the document. cookie string, parse it, and inject it into the cookie jar of request.
6. It is not easy to inject variables into the browser session. To do this, I need to create a string to create a Javascript function.
The Code is as follows:
Robot. prototype. add_page_data = function (page, name, data ){
Page. evaluate (
"Function () {var" + name + "= window." + name + "=" + JSON. stringify (data) + "}"
);
}
7. Some websites are always filled with Code such as console. log () and need to be redefined and output to the desired position. To accomplish this, I do this:
The Code is as follows:
If (! Console. log ){
Var iframe = document. createElement ("iframe ");
Document. body. appendChild (iframe );
Console = window. frames [0]. console;
}
8. Some websites are always filled with Code such as console. log () and need to be redefined and output to the desired position. To accomplish this, I do this:
The Code is as follows:
If (! Console. log ){
Var iframe = document. createElement ("iframe ");
Document. body. appendChild (iframe );
Console = window. frames [0]. console;
}
9. It is not easy to tell the browser that I clicked the tag. To do this, I added the following code:
The Code is as follows:
Var clickElement = window. clickElement = function (id ){
Var a = document. getElementById (id );
Var e = document. createEvent ("MouseEvents ");
E. initMouseEvent ("click", true, true, window, 0, 0, 0, 0, 0, false, 0, null );
A. dispatchEvent (e );
};
10. I also need to limit the maximum concurrency of browser sessions to ensure that the server will not be cracked. In this case, the limit is much higher than that provided by expensive commercial solutions. (Translator's note: The concurrency of the commercial solution is larger than that of the solution)
After all the work, I had a decent PhantomJS + request crawler solution. You must use PhantomJS to log on before you can return a request (). It uses the Cookie set in PhantomJS to verify the login session. This is a huge victory, because we can use the request () stream to download PDF files.
The whole plan is to make it easier for Web developers to understand how to use the jQuery and CSS selectors to create Crawlers for different Web sites. I have not yet proved this idea feasible, but I believe it will soon be.