Is there any good way to solve the problem that web pages written in javascript are easily encountered in crawlers?

Last Update:2018-05-06 Source: Internet

Author: User

Tags network function

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The language is python. Currently, the stock market (q.10jqka.com. cnstockfl # refCountIddb_5093800d_645, db_509381c1_860) that you want to crawl is stuck by javascript again. Because only 52 pieces of information are displayed on one page, you must click the page number below to view all the stock data. It is written in javascript and cannot be processed directly using libraries such as urllib2. I tried webkit (gh uses python. Currently want to climb the flush stock market (http://q.10jqka.com.cn/stock/fl/#refCountId=db_5093800d_645,db_509381c1_860 Again, it is stuck by javascript. Because only 52 pieces of information are displayed on one page, you must click the page number below to view all the stock data. It is written in javascript and cannot be processed directly using libraries such as urllib2. I tried webkit (ghost. py) to simulate clicks. The Code is as follows:
Page, resources = ghost. open ('HTTP: // logs ')
Page, resources = ghost. evaluate ("document. getElementById ('hd '). nextSibling. getElementsByTagName ('P') [13]. getElementsByTagName ('A') [7]. click (); ", expect_loading = True)

The system prompts "Unable to load requested page", or the returned page is "None. I don't know. What is wrong with the code? What should I do? (I have been searching for solutions on Baidu and google for a long time. However, there are not many documents about ghost. py, which cannot be solved .)

And, are there any better solutions to the problem of crawling dynamic web pages? Simulating with webkit seems to slow down the crawling speed, not the best strategy. Reply content: Headless Webkit, open-source PhantomJS, etc.

Being able to parse and run scripts on pages to index dynamic content is one of the important functions of modern crawlers.

Google's Crawler Now Understands JavaScript: What Does This Mean For You?

Your crawler has little to do with JS. You can directly view the Network, view the Network requests sent, analyze each URL, find out the rule, and then use a program to simulate such a request, first, we must be good at using Chrome's Network function. Let's click a few pages and see the Network as follows:

URL of the first page of data:

http://q.10jqka.com.cn/interface/stock/fl/zdf/desc/1/hsa/quote

I have a good example.
Requirement: crawls cartoons drawn from love comics.
Problem: The image name is irregular. The file name and url of the image are generated through complicated js Code, and the image is dynamically loaded. There are various js Code modes, but there is no uniform mode.
Solution: Py8v library. Read the js Code, add a global variable to track the image file name and url, and then Python interacts with the variable to obtain the image file name and url.

The full text is here
[Original] A relatively hack crawler recently written Can you say berserkJS ......
However, this kind of stuff cannot resist.
If the operator is too troublesome, use selenium to simulate the operations performed on the browser. It can also be used to crawl data, but the speed is slow. Open Chrome's developer console or Firefox's FireBug, go to the Network column, and analyze the URLs accessed by ajax.

For crawlers of specific websites, it is too laborious and inefficient to simulate javascript operations. You just need to figure out the url structure of the website and directly construct the form. Selenium with Python Just put an aside, it seems convenient to use udfs and write scripts to compute data. Do you have to crawl all the data by yourself? The phantomjs api is quite spof. It is recommended that the casperjs package is based on the above. A better crawler should solve two problems:
1. Ability to parse dynamic web pages, such as Waterfall websites
2. The website can be blocked.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More