Python Crawler Mastery-handling Dynamic Web pages

Source: Internet
Author: User

Python Crawler Mastery-handling Dynamic Web pages

This article turns from:I spring and autumn community

0x01 Preface
in the process of crawler development, we will encounter a lot of difficult problems, of course, for common problems such as UA and other changes in the question, we are not in the scope of discussion, since to be cultured, naturally can not say these completely no meaning small problems.
0x02 Selenium + phantomjs
this thing is laosheng long talk about the problem, basic I ask friends around, they can tell this solution:
Selenium + PHANTOMJS (Firefox chrome and the like)
but the real practice of people, is not put this thing into production environment, the first biggest problem is selenium + phantomjs very slow, this slow reason is because he wants to load this page all the content, compared to the tablet resources, link in the CSS,JS will be loaded, It also renders the entire page, allowing you to manipulate the elements of the page after the rendering is finished. Of course, the reader may ask, Selenium as a module that can automate the writing of test scripts, he is a self-contained HOOK function, in the Selenium API is also described Selenium can control waiting for an element to load successfully return page data.
Yes, indeed, we can use Selenium's built-in API to operate the browser to do a variety of operations, such as analog click, simulation forms, and even JS, but the biggest problem we still do not solve: the bottom line is to operate the browser to do the work, Launch need to open the browser (waiting for a certain time), access to the page after rendering, download the appropriate resources, the implementation of JS, so many steps, each step needs more or less waiting time, this is like, we are using the browser to do such things, but with a precise mouse positioning.
of course said so much, although Selenium is not suitable for production solutions, there is no other solution.
0x03 Execjs
Execjs is a module that executes JS in Python, and you may find it refreshing to hear this: huh? Then I can crawl down the JS code and then manually control the JS execution, then you can control the elements you want, get the desired results, but also do not lose efficiency.
but I would like to say that the idea is actually very naive, although with this JS engine, but we need a lot of wheels, why? Come and listen to me step by step explanation:
1. The power of JS is not in the loose syntax and fault tolerance, but in the operation of BOM objects and DOM objects. For example, for example, a form of a webpage is submitted by manipulating JS to execute it. So, the question is, do you have a way to simply use this EXECJS to execute this JS to submit a form? Obviously, this is not feasible. Why is it? Because for us Execjs is a standalone module, we have no way to connect our static stripped HTML document and Execjs.
2. If you want to establish a connection, then you need to complete the JS to the HTML DOM object binding, how to complete it? JS in the browser How to bind with the DOM tree, you need to do. ButWhat to do, first you need a build yourselfDomThe tree before you can manually bind it. This wheel, indeed, is very big.  But if you really have a lot of time, then how do you do it? Yes, or you could HOOK a webkit or you could build yourself an HTML parser. So I'll just mention a little bit of this interesting thing here: if you build an HTML parser:recently useful PLY wrote a Lexer at that time prepared to do an analytic DOM tree HTML Parser, the first step of their own practice is also felt that this thing theoretically is completely feasible, but can be completed depends on personal perseverance and your personal programming ability.
0x04 Ghost
about Ghost, in fact, I personally is relatively respected, but in fact, he is not particularly perfect, it is more like a combination of Selenium and phantomjs to me, how to say, actually Ghost this module with the QT WebKit, You have to be forced to install pyside or PYQT4 at the time of installation, in fact, it's hard for me to understand why a thing like this doesn't have a graphical interface that uses QT and pyside as an engine? Is it really so hard to construct a browser engine alone? In fact, there is no relationship between the installed, after all, I think it is better than selenium with Phantomjs easy to use.
then, let's talk about some of this Ghost's problems.
first, one of the benefits of using ghost is that we don't have to put a binary browser in the path so that we don't have to spend time opening the browser, because Ghost is a full-featured Python implementation (with QT WebKit) that's lightweight without A graphical browser.
Moreover, Ghost in the initialization time, there is an option can not download the picture, but there is no way to prevent it to download JS and CSS, in fact, this is also can be forgiven, after all, they are in use, it is necessary to download JS in the local filter.
At the same time ghost still provides the corresponding API these APIs and selenium API features are not particularly large differences, there will be processing forms, execute AJAX to load dynamic pages, so that Ghost is a perfect solution?
In fact, there is his own shortcomings, is that we still can not completely control every process, such as if we just want it to parse the DOM tree, do not dynamically execute the JS script, and I want to get his DOM tree manually do some operations. There is no way to do this. But it is not completely no way, such as a domestic factory they do a HOOK a browser to detect the idea of XSS we can put forward in the future article, the specific operation, it depends on the programming skills of everyone.


0x05 Principle Summary
of course, the reader has already seen the induction, the Dynamic Web page (via JS loading) of the Web page information collection, mainly divided into three kinds of programs:1. Entity-based browser operation solution (for test environments not available for a large amount of information acquisition). 2. The solution based on the depth control JS script execution (fastest, most difficult to write). 3. WebKit-based solutions. (relatively more eclectic)  This article is provided by I Spring School:http://bbs.ichunqiu.com/thread-11098-1-1.html

Python Crawler Mastery-handling Dynamic Web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.