I want to crawl the content on the site, but a lot of content are JS generated, can I ask if there is a library to parse JS easy to crawl page JS parsing HTML library AH? such as mall product information, QQ space content. No matter what language, can quickly develop on the line, thank you
Reply content:
I want to crawl the content on the site, but a lot of content are JS generated, can I ask if there is a library to parse JS easy to crawl page JS parsing HTML library AH? such as mall product information, QQ space content. No matter what language, can quickly develop on the line, thank you
This is not only parsing JS, but also the browser kernel!
Recommended several:
- Qtwebkit, known to have Python and C + + support
- Phantomjs, known for JavaScript, Coffeescript, and Python support, is also a Webkit kernel
- Slimerjs, known to have JavaScript support, Gecko kernel, and Firefox is the same, can also run on Firefox
- Casperjs, JavaScript support is known. Two further packages on top
Feel that your problem may not necessarily be something as heavy as that.
You want to catch the page content, you know it is from JS, then this JS is from where? It could be either the page itself or the JSON of Ajax.
Find these JS that contains what you need, and then JSON to use a JSON parser, is JS words simple can also be extracted with regular.
Phantomjs maybe the best solution for you, also, Casperjs are based on PHANTOMJS so can be a useful tool to grab webpage Content created by JavaScript or Ajax
Try node. js
From your description sounds, is to grasp the page, but the content is JS production page, you use the method of scratching the page, grabbed down an empty shell, nothing. Right?
In this case, I suggest you use "headless browser", the first push upstairs said Phantomjs, it is essentially a browser, just no user interface, through programming to call, finally can and your external code to generate some interaction, to you back (the final generated) HTML, to you and so on.
Use Nodejs directly, then execute the return content.
I generally in this case, are the JS code to see themselves, find the place and then the implementation of their own, and Java seems to have a library can be executed JS code, for example, I do Sina Weibo simulation login is directly to the site JS encryption function extracted out, Then execute the results in the code, and finally the mock request is done.