I want to capture the content on the website, but many of the content is generated by js. Is there any library capable of parsing js that can easily capture the html library after page js parsing? Such as mall product information and QQ space content. No matter what language, you can develop it quickly. Thank you for capturing the content on the website, but many of the content is generated by js, is there any html library that can parse js to easily capture pages after js parsing? Such as mall product information and QQ space content. No matter what language, you can develop it quickly. Thank you.
Reply content:
I want to capture the content on the website, but many of the content is generated by js. Is there any library capable of parsing js that can easily capture the html library after page js parsing? Such as mall product information and QQ space content. No matter what language, you can develop it quickly. Thank you.
This is not only about parsing js, but also about the browser kernel!
Recommended:
- QtWebKit, known to support Python and C ++
- PhantomJS, known to support JavaScript, CoffeeScript, and Python, is also the Webkit Kernel
- SlimerJS, known to support JavaScript, Gecko kernel, is the same as Firefox and can also run on Firefox.
- CasperJS, known to support JavaScript. Further encapsulation of the above two
I feel that your problem may not have to be so important.
The page content you want to capture, you know it comes from js, so where does this js come from? It may be the page or ajax json.
Find out the js that contains the content you need, and then use a json parser if it is json. If it is js, you can simply extract it using regular expressions.
PhantomJs maybe the best solution for you, also, casperJs is based on phantomJs that can be a useful tool to grab webpage content created by javascript or ajax
Zookeeper node. js
From your description, it sounds like you want to capture the page, but the content in the page is produced by JS. You can capture an empty shell by capturing the page. Right?
In this case, we recommend that you use "headless Browser", which is the first example of PhantomJS. It is essentially a browser, but there is no user interface. It is called through programming, finally, you can interact with your external code, return HTML to you (the final one), and give it to you.
Use nodejs directly, and then execute the returned content.
In this case, I usually take a look at the js Code by myself, find the desired place, and then implement it by myself. in java, there seems to be a library that can execute js Code, for example, when I perform a simulated login on Sina Weibo, I directly extract the encryption function from the website js, execute the code to obtain the result, and finally simulate the request.