Is there any good way to solve the problem that web pages written in javascript are easily encountered in crawlers?

Source: Internet
Author: User
Tags network function
The language is python. Currently, the stock market (q.10jqka.com. cnstockfl # refCountIddb_5093800d_645, db_509381c1_860) that you want to crawl is stuck by javascript again. Because only 52 pieces of information are displayed on one page, you must click the page number below to view all the stock data. It is written in javascript and cannot be processed directly using libraries such as urllib2. I tried webkit (gh uses python. Currently want to climb the flush stock market (http://q.10jqka.com.cn/stock/fl/#refCountId=db_5093800d_645,db_509381c1_860 Again, it is stuck by javascript. Because only 52 pieces of information are displayed on one page, you must click the page number below to view all the stock data. It is written in javascript and cannot be processed directly using libraries such as urllib2. I tried webkit (ghost. py) to simulate clicks. The Code is as follows:
Page, resources = ghost. open ('HTTP: // logs ')
Page, resources = ghost. evaluate ("document. getElementById ('hd '). nextSibling. getElementsByTagName ('P') [13]. getElementsByTagName ('A') [7]. click (); ", expect_loading = True)

The system prompts "Unable to load requested page", or the returned page is "None. I don't know. What is wrong with the code? What should I do? (I have been searching for solutions on Baidu and google for a long time. However, there are not many documents about ghost. py, which cannot be solved .)

And, are there any better solutions to the problem of crawling dynamic web pages? Simulating with webkit seems to slow down the crawling speed, not the best strategy. Reply content: Headless Webkit, open-source PhantomJS, etc.

Being able to parse and run scripts on pages to index dynamic content is one of the important functions of modern crawlers.

Google's Crawler Now Understands JavaScript: What Does This Mean For You?

Your crawler has little to do with JS. You can directly view the Network, view the Network requests sent, analyze each URL, find out the rule, and then use a program to simulate such a request, first, we must be good at using Chrome's Network function. Let's click a few pages and see the Network as follows:

URL of the first page of data:

http://q.10jqka.com.cn/interface/stock/fl/zdf/desc/1/hsa/quote
I have a good example.
Requirement: crawls cartoons drawn from love comics.
Problem: The image name is irregular. The file name and url of the image are generated through complicated js Code, and the image is dynamically loaded. There are various js Code modes, but there is no uniform mode.
Solution: Py8v library. Read the js Code, add a global variable to track the image file name and url, and then Python interacts with the variable to obtain the image file name and url.

The full text is here
[Original] A relatively hack crawler recently written Can you say berserkJS ......
However, this kind of stuff cannot resist.
If the operator is too troublesome, use selenium to simulate the operations performed on the browser. It can also be used to crawl data, but the speed is slow. Open Chrome's developer console or Firefox's FireBug, go to the Network column, and analyze the URLs accessed by ajax.

For crawlers of specific websites, it is too laborious and inefficient to simulate javascript operations. You just need to figure out the url structure of the website and directly construct the form. Selenium with Python Just put an aside, it seems convenient to use udfs and write scripts to compute data. Do you have to crawl all the data by yourself? The phantomjs api is quite spof. It is recommended that the casperjs package is based on the above. A better crawler should solve two problems:
1. Ability to parse dynamic web pages, such as Waterfall websites
2. The website can be blocked.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.