The language is python. Currently, the stock market (q.10jqka.com. cnstockfl # refCountIddb_5093800d_645, db_509381c1_860) that you want to crawl is stuck by javascript again. Because only 52 pieces of information are displayed on one page, you must click the page number below to view all the stock data. It is written in javascript and cannot be processed directly using libraries such as urllib2. I tried webkit (gh uses python. Currently want to climb the flush stock market (http://q.10jqka.com.cn/stock/fl/#refCountId=db_5093800d_645,db_509381c1_860
Again, it is stuck by javascript. Because only 52 pieces of information are displayed on one page, you must click the page number below to view all the stock data. It is written in javascript and cannot be processed directly using libraries such as urllib2. I tried webkit (ghost. py) to simulate clicks. The Code is as follows:
Page, resources = ghost. open ('HTTP: // logs
')
Page, resources = ghost. evaluate ("document. getElementById ('hd '). nextSibling. getElementsByTagName ('P') [13]. getElementsByTagName ('A') [7]. click (); ", expect_loading = True)
The system prompts "Unable to load requested page", or the returned page is "None. I don't know. What is wrong with the code? What should I do? (I have been searching for solutions on Baidu and google for a long time. However, there are not many documents about ghost. py, which cannot be solved .)
And, are there any better solutions to the problem of crawling dynamic web pages? Simulating with webkit seems to slow down the crawling speed, not the best strategy. Reply content: Headless Webkit, open-source PhantomJS, etc.
Being able to parse and run scripts on pages to index dynamic content is one of the important functions of modern crawlers.
Google's Crawler Now Understands JavaScript: What Does This Mean For You?
Your crawler has little to do with JS. You can directly view the Network, view the Network requests sent, analyze each URL, find out the rule, and then use a program to simulate such a request, first, we must be good at using Chrome's Network function. Let's click a few pages and see the Network as follows:
URL of the first page of data:
http://q.10jqka.com.cn/interface/stock/fl/zdf/desc/1/hsa/quote
I have a good example.
Requirement: crawls cartoons drawn from love comics.
Problem: The image name is irregular. The file name and url of the image are generated through complicated js Code, and the image is dynamically loaded. There are various js Code modes, but there is no uniform mode.
Solution: Py8v library. Read the js Code, add a global variable to track the image file name and url, and then Python interacts with the variable to obtain the image file name and url.
The full text is here
[Original] A relatively hack crawler recently written
Can you say berserkJS ......
However, this kind of stuff cannot resist.
If the operator is too troublesome, use selenium to simulate the operations performed on the browser. It can also be used to crawl data, but the speed is slow. Open Chrome's developer console or Firefox's FireBug, go to the Network column, and analyze the URLs accessed by ajax.
For crawlers of specific websites, it is too laborious and inefficient to simulate javascript operations. You just need to figure out the url structure of the website and directly construct the form. Selenium with Python
Just put an aside, it seems convenient to use udfs and write scripts to compute data. Do you have to crawl all the data by yourself? The phantomjs api is quite spof. It is recommended that the casperjs package is based on the above. A better crawler should solve two problems:
1. Ability to parse dynamic web pages, such as Waterfall websites
2. The website can be blocked.