Python crawls Netease comments and python crawls Netease

Source: Internet
Author: User

Python crawls Netease comments and python crawls Netease

Shortly after learning python, the web pages crawled recently were directly available in the source code. When I saw Netease news comments, I found that the comments were loaded in json format .....

Crawling web pages is Xi Da 2015 visits to English comments page http://comment.news.163.com/news_guonei8_bbs/SPEC0001B60046CG.html

The procedure is as follows:

1. Use Google Chrome to analyze the data loaded on the webpage Homepage

Open the webpage ---- press F12 ---- click Network. This is blank.

After refreshing, it will appear as follows: (I have loaded pages before, so the json data is not displayed completely)

Click a file in json format, find the url, and open it on the webpage to see if you want the data:

When I visited the webpage for the first time, I typed three items. There was only one of them, and the homepage URL was:

Http://comment.news.163.com/data/news_guonei8_bbs/df/SPEC0001B60046CG_1.html? _ = 14455959217790

The data is:

2. Other comment pages

When you click another comment page, click the Clear button in the Network to search for json-from the second page, it is almost the same

Click find url and open it in the browser

Although the data is garbled, what is read in Python can be viewed normally.

 

3. website Rules

At first, I thought there was a rule behind the web site, and later I found that removing it did not affect my website,

So you only need to replace the page number with the corresponding comment page (I can only open 34 pages ??)

4. Code

Note: because the data starts with a variable name and end with a semicolon, an error is reported when json. loads (data) is used. Therefore, data is processed first.

1 # encoding = UTF-8 2 3 import urllib2 4 import json 5 import re 6 import time 7 class JSON (): 8 def _ init _ (self): 9 self. user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '10 self. headers = {'user-agent': self. user_agent} 11 self. url1 = 'HTTP: // javasdef getUrls (self, pageIndex): 13 url2 = 'HTTP: // comment.news.163.com/cache/newlist/ News_guonei8_bbs/spec0001b60046cg_'{str(pageindex{'.html '14 return url215 def getHtml (self, url): 16 try: 17 request = urllib2.Request (url, headers = self. headers) 18 respone = urllib2.urlopen (request) 19 html = respone. read () 20 return html21 failed t urllib2.URLError, e: 22 if hasattr (e, 'reason '): 23 print u "connection failed", e. reason24 return None25 # process strings. if not, open the file and process 26 def strDeal (self, data, pageIndex): 27 if pageIndex = 1:28 data = data. replace ('var replyData = ', '') 29 else: 30 data = data. replace ('var newPostList = ', '') 31 reg = re. compile ("& nbsp; \ [<a href =''> ") 32 data = reg. sub ('--', data) 33 reg2 = re. compile ('<\\\/a >\]') # <\/a>] regular expression? 34 data = reg2.sub ('', data) 35 reg3 = re. compile ('<br>') 36 data = reg3.sub ('', data) 37 return data38 # parse json data and save it to file 39 def parserJson (self): 40 with open('wangyi2.txt ', 'A') as f: 41 f. write ('user id' + '|' + 'comments '+' | '+ 'thumb ups' + '\ n') 42 for I in range ): 43 if I = 1:44 url = self. url145 data = self. getHtml (url) 46 data = self. strDeal (data, I) [:-1] 47 value = json. loads (data) 48 f1_open('wangyi2.txt ', 'A') 49 50 for item in value ['hotp Osts ']: 51 f. write (item ['1'] ['F']. encode ('utf-8') + '|') 52 f. write (item ['1'] ['B']. encode ('utf-8') + '|') 53 f. write (item ['1'] ['V']. encode ('utf-8') + '\ n') 54 f. close () 55 print 'sleeping pageload % d/34' % i56 time. sleep (6) 57 else: 58 url = self. getUrls (I) 59 data = self. getHtml (url) 60 data = self. strDeal (data, I) [:-2] 61 # conversion. The first data type is str, which uses json. loads () function to obtain the original data. At this time, the value data type is dict, and then the dictionary can be accessed normally. 62 value = json. loads (data) 63 f1_open('wangyi2.txt ', 'A') 64 65 for item in value ['newposts']: 66 f. write (item ['1'] ['F']. encode ('utf-8') + '|') 67 f. write (item ['1'] ['B']. encode ('utf-8') + '|') 68 f. write (item ['1'] ['V']. encode ('utf-8') + '\ n') 69 70 f. close () 71 print 'sleeping pageload % d/34' % i72 time. sleep (6) 73 74 75 js = JSON () 76 js. parserJson ()

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.