Phantomjs captures the rendered JS webpage (Python code), phantomjspython

Source: Internet
Author: User

Phantomjs captures the rendered JS webpage (Python code), phantomjspython

Recently, a website needs to be crawled, but the pages are generated after JS rendering. The common crawler framework is not fixed, so I thought of using Phantomjs to build a proxy.

Python calls Phantomjs and it seems that there are no ready-made third-party libraries (if any, please let me know). After walking around, we found that only pyspider provided the ready-made solution.

After a simple trial, I feel that pyspider is more like a crawler tool for beginners. It is like a mother-in-law, sometimes meticulous, and sometimes chattering. Lightweight gadgets should be more popular. I am also selfish and can use them with my favorite BeautifulSoup, instead of learning PyQuery (pyspider is used to parse HTML ), it does not have to endure the bad experience of writing Python in a browser ).

So it took an afternoon to split the part of pyspider implementing the Phantomjs proxy into a small crawler module. I hope you will like it (thanks to binux !).

Preparations

Of course you need Phantomjs! (In Linux, it is best to use the supervisord daemon. You must keep Phantomjs In the Enabled state when capturing it)
Start with phantomjs_fetcher.js in the project path: phantomjs phantomjs_fetcher.js [port]
Install tornado dependencies (the httpclient module of tornado is used)

Calling is super simple

From tornado_fetcher import Fetcher # create a crawler> fetcher = Fetcher (user_agent = 'phantomj', # simulate the browser's User-Agent phantomjs_proxy = 'HTTP: // localhost: 100 ', # phantomjs address poolsize = 10, # maximum number of httpclient async = False # synchronous or asynchronous) # Start to connect to Phantomjs code and render JS! >>> Fetcher. fetch (url) # execute an additional JS script after the rendering is successful (use the function to pack it!) >>> Fetcher. fetch (url, js_script = 'function () {setTimeout ("window. scrollTo (1000 )')

Code https://github.com/2shou/PhantomjsFetcher

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.