This article to share the content is 3 using Python how to crawl the contents of JS, has a certain reference value, the need for friends can refer to
First, when writing the crawler software to obtain the required content may encounter the required content is added by JavaScript, when the acquisition is empty when we get Sina News comments on the use of ordinary methods can not be obtained
Common Get code example:
Import requestsfrom bs4 Import beautifulsoupres = Requests.get (' http://news.sina.com.cn/c/nd/2017-06-12/ Doc-ifyfzhac1650783.shtml ') res.encoding = ' utf-8 ' soup = BeautifulSoup (res.text, ' html.parser ') #取评论数commentCount = Soup.select_one (' #commentCount1 ') print (Commentcount.text)
The result obtained at this time is empty because the content is stored in the JS file
So we need to look for the JS to store the comment content to find that we found it stored in the change JS
Put the content into the JSON data viewer we found that the total number of comments and comments in the JS file in a JSON format to store
In the message header we can see the JS file access path and request method
code example
Import jsoncomments = Requests.get (' http://comment5.news.sina.com.cn/page/info?version=1&format=js& channel=gn&newsid=comos-fyfzhac1650783 ') comments.encoding = ' utf-8 ' Print (comments) JD = Json.loads ( Comments.text.strip (' var data= ')) #移除改var data= turn it into JSON data print (jd[' result ' [' count '] [' total '])
Note: This explains why you need to remove Var data= because the string prefix is included with Var data= when it is not in the JSON data format, so you need to remove it from the request content when you convert it.
Why use the jd[' result '[' count '] [' Total'] when you take comments