How do Python crawlers get the URL and Web content generated by JS?

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Want to try to climb down the Bupt People's Forum, but see the source of the page is JS, almost no information I want.

Reply content:

Today, I stumbled upon the PyV8, and it feels like you want it.
It directly constructs a JS runtime environment, which means that you can execute the JS code on the page directly in Python to get the content you need.
Reference:
http://www. silverna.org/blog/? p=252
https:// code.google.com/p/pyv8/ I directly look at the JS source code, analysis, and then climbed.
For example, if the page is to use AJAX to request a JSON file, I'll crawl that page, get the parameters required by Ajax, then request the JSON page directly, then decode, process the data and put it into storage.
If you run all the JS on the page directly (as the browser does) and then get the final HTML DOM tree, the performance is very bad and it is not recommended to use such a method. Because Python and JS performance are inherently poor, if you do this, you will consume a lot of CPU resources and end up with a very low crawl efficiency. JS code is required to run the JS engine, Python can only get HTML, CSS, and JS Raw code through HTTP requests.
I do not know that there is no JS engine written in Python, the estimated demand is small.
I usually use PHANTOMJS. , Casperjs These engines are to do the browser crawl.
Write JS code directly in it to do DOM manipulation, analysis, file output results.
Let Python invoke the program and get the content by reading the file. Last year I really climbed this kind of data, because in a hurry, my method is more ugly.

PYQT has a specific library to emulate the browser request and behavior (as if it were WebKit, forget, check it out.) Use just a few lines of code is enough), in one run program, the first time (only the first time) The return result is JS run after the code. Then wrote a py script to do an access resolution, and then wrote the Windows script by passing command line parameters loop this py script, and finally get the data.

Method Dirty Some, but the data to get the good ~ for a website, you can see the network request to find the actual content of those who have targeted to send. If it is universal, you have to use headless browser, such as PHANTOMJS. Another climbing bupt people forum.

Literary methods, on the browser engine, such as Phantomjs , use it to export HTML, and then parse the HTML with Python. Do not directly PHANTOMJS analysis, although I know it is easy, why? that's not a Python crawler.Because the unified use of Python to resolve more unified, this assumes that you are still crawling non-JS page.

An ordinary way to parse an AJAX request. Even if it is JS rendered, the data is transmitted through the HTTP protocol. What the? You can't simulate it? read the HTTP protocol again., in fact, experience, such as the Bupt People forum request text needs

X-Requested-With:XMLHttpRequest

Here's a chestnut:
The job list of the hook net

After clicking on Android, we uploaded a few parameters from the browser to the server of the pull-hook
One is first =true, one is kd = Android, (keyword) one is PN =1 (page number page)

So we can imitate this step to construct a packet to simulate the user's click action.

post_data = {'first':'true','kd':'Android','pn':'1'}

Although this is a long time ago problem, the Lord seems to have solved the problem. But the way to see a lot of answers is a bit too heavy, here to share a more efficient, less resource-intensive approach. Since the title does not specify what is needed, here's an example of the link and title of all posts on the home page.

First of all, be sure to remember that the browser environment is very serious about memory and CPU consumption, simulating the browser environment Crawler code to avoid as much as possible。 Remember, for some front-end rendered pages, although the HTML source does not see the data we need, but it is more likely that it will be a second request to get the pure data (very likely to exist in JSON format), we do not need to simulate the browser, but can eliminate the consumption of parsing HTML.

Then, open the Bupt People Forum home page, found its home page HTML source code does not have the content of the article displayed, then, it is possible that this is through JS asynchronous loading to the page. Through the browser development tool (Chrome browser under OS X through Command+option+i or win/linux through the F12) in the loading of the first page of the request, easy to find, the following requests:
The selected request response is the first page of the article link, in the preview option you can see the post-rendered previews: The selected request response is the first page of the article link, in the preview option you can see the post-rendered previews:
at this point, we are sure that this link can be obtained from the homepage of the article and links. At this point, we are sure that this link can be obtained from the homepage of the article and links.

In the headers option, with the request header and request parameters for this request, we can get the same response by simulating this request with Python. Together with BeautifulSoup and other libraries to parse the HTML, you can get the corresponding content.

For how to simulate the request and how to parse the HTML, please visit my column, there is a detailed introduction, here will not repeat.

In this way, the data can be crawled without simulating the browser environment, and the memory and CPU consumption and crawl speed are greatly improved. When writing a crawler, it is important to remember that you do not emulate the browser environment if it is not necessary. If you are under Windows, you can try to invoke the WebBrowser control in your Windows system. In addition the IE itself provides an interface. However, both of these methods to render the page, the performance of a bit wasteful, in order to speed up the picture of IE can be downloaded display off, and then through the click and other methods to simulate the real behavior. Google Phantom JS



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How do Python crawlers get the URL and Web content generated by JS?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How do Python crawlers get the URL and Web content generated by JS?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support