Python crawls Netease comments and python crawls Netease
Shortly after learning python, the web pages crawled recently were directly available in the source code. When I saw Netease news comments, I found that the comments were loaded in json format .....
Crawling web pages is Xi Da 2015 visits to English comments page http://comment.news.163.com/news_guonei8_bbs/SPEC0001B60046CG.html
The procedure is as follows:
1. Use Google Chrome to analyze the data loaded on the webpage Homepage
Open the webpage ---- press F12 ---- click Network. This is blank.
After refreshing, it will appear as follows: (I have loaded pages before, so the json data is not displayed completely)
Click a file in json format, find the url, and open it on the webpage to see if you want the data:
When I visited the webpage for the first time, I typed three items. There was only one of them, and the homepage URL was:
Http://comment.news.163.com/data/news_guonei8_bbs/df/SPEC0001B60046CG_1.html? _ = 14455959217790
The data is:
2. Other comment pages
When you click another comment page, click the Clear button in the Network to search for json-from the second page, it is almost the same
Click find url and open it in the browser
Although the data is garbled, what is read in Python can be viewed normally.
3. website Rules
At first, I thought there was a rule behind the web site, and later I found that removing it did not affect my website,
So you only need to replace the page number with the corresponding comment page (I can only open 34 pages ??)
4. Code
Note: because the data starts with a variable name and end with a semicolon, an error is reported when json. loads (data) is used. Therefore, data is processed first.
1 # encoding = UTF-8 2 3 import urllib2 4 import json 5 import re 6 import time 7 class JSON (): 8 def _ init _ (self): 9 self. user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '10 self. headers = {'user-agent': self. user_agent} 11 self. url1 = 'HTTP: // javasdef getUrls (self, pageIndex): 13 url2 = 'HTTP: // comment.news.163.com/cache/newlist/ News_guonei8_bbs/spec0001b60046cg_'{str(pageindex{'.html '14 return url215 def getHtml (self, url): 16 try: 17 request = urllib2.Request (url, headers = self. headers) 18 respone = urllib2.urlopen (request) 19 html = respone. read () 20 return html21 failed t urllib2.URLError, e: 22 if hasattr (e, 'reason '): 23 print u "connection failed", e. reason24 return None25 # process strings. if not, open the file and process 26 def strDeal (self, data, pageIndex): 27 if pageIndex = 1:28 data = data. replace ('var replyData = ', '') 29 else: 30 data = data. replace ('var newPostList = ', '') 31 reg = re. compile ("& nbsp; \ [<a href =''> ") 32 data = reg. sub ('--', data) 33 reg2 = re. compile ('<\\\/a >\]') # <\/a>] regular expression? 34 data = reg2.sub ('', data) 35 reg3 = re. compile ('<br>') 36 data = reg3.sub ('', data) 37 return data38 # parse json data and save it to file 39 def parserJson (self): 40 with open('wangyi2.txt ', 'A') as f: 41 f. write ('user id' + '|' + 'comments '+' | '+ 'thumb ups' + '\ n') 42 for I in range ): 43 if I = 1:44 url = self. url145 data = self. getHtml (url) 46 data = self. strDeal (data, I) [:-1] 47 value = json. loads (data) 48 f1_open('wangyi2.txt ', 'A') 49 50 for item in value ['hotp Osts ']: 51 f. write (item ['1'] ['F']. encode ('utf-8') + '|') 52 f. write (item ['1'] ['B']. encode ('utf-8') + '|') 53 f. write (item ['1'] ['V']. encode ('utf-8') + '\ n') 54 f. close () 55 print 'sleeping pageload % d/34' % i56 time. sleep (6) 57 else: 58 url = self. getUrls (I) 59 data = self. getHtml (url) 60 data = self. strDeal (data, I) [:-2] 61 # conversion. The first data type is str, which uses json. loads () function to obtain the original data. At this time, the value data type is dict, and then the dictionary can be accessed normally. 62 value = json. loads (data) 63 f1_open('wangyi2.txt ', 'A') 64 65 for item in value ['newposts']: 66 f. write (item ['1'] ['F']. encode ('utf-8') + '|') 67 f. write (item ['1'] ['B']. encode ('utf-8') + '|') 68 f. write (item ['1'] ['V']. encode ('utf-8') + '\ n') 69 70 f. close () 71 print 'sleeping pageload % d/34' % i72 time. sleep (6) 73 74 75 js = JSON () 76 js. parserJson ()