Python _ web crawler (SINA news capture), python Sina news

Source: Internet
Author: User
Tags jupyter jupyter notebook

Python _ web crawler (SINA news capture), python Sina news

Preparations before crawling:

  • BeautifulSoup import: pip install BeautifulSoup4
  • Import requests: pip install requests
  • Download jupyter notebook: pip install jupyter notebook
  • Download python and configure the environment (anocanda can be used, which provides many python modules)

 

Json
  • Definition: a format for data exchange.
Javascript Object
  • Definition: A javascript reference type

 

In addition UTF-8 'and 'gbk', 'gb2312', 'ISO-8859-1', 'gbk', etc.

 

Use requests to obtain webpage Information

You can use BeautifulSoup to convert web page information into operable parts.

1 soup = BeautifulSoup (res. text, 'html. parser ') 2 # convert the webpage information obtained by requests to the object stored in the soup, and specify its parser as 'html. parser '. Otherwise, a warning is reported.

 

Use the select method in beautifulSoup to obtain the corresponding elements.And the retrieved elements are in the list form. You can use the for loop to parse them one by one.

1 alink = soup.select('h1')2 3 for link in alink:4     print(link.text)

 

After obtaining the html Tag value, you can use ['href '] to obtain the value of the 'href' attribute, as shown in

1 for link in soup.select('a'):2    print(link['href'])

 

Obtain the news number:

* Strip () can remove the front and back spaces. Adding strings in parentheses can remove the specified string. rstrip () can remove the right and lstrip () can remove the left;

* Split ('/') Splits strings based on specified characters

 

Use of the re regular expression:

1 import re2 3 m = re. search ('doc-I (. * mongo.shtml ', newsurl) # returns the matching string 4 print (m. group (1) # group (0) can obtain all matched parts. group (1) can only obtain the parts in parentheses.

 

Use the for loop to obtain multi-page links of news

1 url = 'HTTP: // api.roll.news.sina.com.cn/zt_list? Channel = news & cat_1 = gnxw & cat_2 = gdxw1 | = gatxw | = zs-pl | = mtjj & level = 1 | = 2 & show_ext = 1 & show_all = 1 & show_num = 22 & tag = 1 & format = json & page ={}& callback = newsloadercallback & _ = 1501000415111 '2 3 for I in rannge): 4 print (url. format (I) 5 # format can replace the braces in the url (we delete the braces to be modified and replace them with braces) with the values we want to add (such as I in the code above)

 

Obtain the news release time:

  The obtained information may contain components, that is, other elements that we don't need, such as those of the publishing house. You can use contents to separate the elements into lists, use contents [0] to obtain corresponding elements

1 # obtain the publishing time 2 from datetime import datetime3 4 res = requests. get ('HTTP: // news.sina.com.cn/c/nd/2017-07-22/doc-ifyihrmf3191202.shtml') 5 res. encoding = 'utf-8' 6 soup = BeautifulSoup (res. text, 'html. parser ') 7 timesource = soup. select ('. time-source ') 8 print (timesource [0]. contents [0])

  Time String Conversion

1 # String Conversion time:-strptime2 dt = datetime. strptime (timesource, '% Y % m month % d % H: % m') 3 4 # Time conversion string:-strftime5 dt. strftime ('% Y-% m-% D ')

 

Obtain News Text:

  Check its category and obtain the news text according to the preceding select statement. The obtained content is in the form of list, you can use the for loop to remove the content tag and add it to the list you created (such as article = []).

* '\ N'. join (article) can be used to separate each item in the article list with the linefeed' \ n;

1 # obtain the content of a single news article 2 article = [] 3 for p in soup. select ('. article p'): 4 article. append (p. text. strip () 5 print ('\ n '. join (article ))

The code for getting a single news article can be completed in one line:

1 # Code 2 print ('\ n '. join ([p. text. strip () for p in soup. select ('. article p')])

 

Get comments count(When obtaining the number of comments, you will find that the comments are sent to the browser in the form of js. Therefore, you must first convert the obtained content to the json format to read the python dictionary.

1 # number of comments obtained 2 import requests3 import json4 comment = requests. get ('HTTP: // comment5.news.sina.com.cn/page/info? Version = 1 & format = js & c \ 5 hannel = gn & newsid = comos-fyihrmf3218511 & group = & compress = 0 & ie = UTF-8 & oe = UTF-8 & page = 1 & page_size = 20 ') # Get related content from the comment address 6 comment. encoding = 'utf-8' 7 jd = json. loads (comment. text. strip ('var data = ') 8 jd ['result'] ['Count'] ['Total']

 

Complete code(Take Sina news as an example):

1 # obtain the news title, content, time and comment count 2 import requests 3 from bs4 import BeautifulSoup 4 from datetime import datetime 5 import re 6 import json 7 import pandas 8 9 def getNewsdetial (newsurl): 10 res = requests. get (newsurl) 11 res. encoding = 'utf-8' 12 soup = BeautifulSoup (res. text, 'html. parser ') 13 newsTitle = soup. select ('. page-header h1 ') [0]. text. strip () 14 nt = datetime. strptime (soup. select ('. time-source ') [0]. cont Ents [0]. strip (), '% Y % m month % d % H: % m') 15 newsTime = datetime. strftime (nt, '% Y-% m-% d % H: % m') 16 newsArticle = getnewsArticle (soup. select ('. article p ') 17 newsAuthor = newsArticle [-1] 18 return newsTitle, newsTime, newsArticle, newsAuthor19 def getnewsArticle (news): 20 newsArticle = [] 21 for p in news: 22 newsArticle. append (p. text. strip () 23 return newsArticle24 25 # Get comments 26 27 def getCommentCount (newsurl): 28 m = Re.search('doc-i(.w.20..shtml ', newsurl) 29 newsid = m. group (1) 30 commenturl = 'HTTP: // comment5.news.sina.com.cn/page/info? Version = 1 & format = js & channel = gn & newsid = comos-{} & group = & compress = 0 & ie = UTF-8 & oe = UTF-8 & page = 1 & page_size = 20 '31 comment = requests. get (commenturl. format (newsid) # Replace the place to be modified with braces, and place newsid In the braces with format 32 jd = json. loads (comment. text. lstrip ('var data = ') 33 return jd ['result'] ['Count'] ['Total'] 34 35 36 def getNewsLinkUrl (): 37 # obtain the asynchronously loaded news address (that is, obtain all paging news addresses) 38 urlFormat = 'HTTP: // api.roll.news.sina.com.cn/zt_list? Channel = news & cat_1 = gnxw & cat_2 = gdxw1 | = gatxw | = zs-pl | = mtjj & level = 1 | = 2 & show_ext = 1 & show_all = 1 & show_num = 22 & tag = 1 & format = json & page ={}& callback = newsloadercallback & _ = 1501000415111 '39 url = [] 40 for I in range (1, 10): 41 res = requests. get (urlFormat. format (I) 42 jd = json. loads (res. text. lstrip ('newsloadercallback ('). rstrip (');') 43 url. extend (getUrl (jd) # differences between entend and append 44 return url45 46 def getUrl (jd ): 47 # obtain the news address on each page 48 url = [] 49 for I in jd ['result'] ['data']: 50 url. append (I ['url']) 51 return url52 53 # obtain news time, edit, content, title, comment quantity, and integrate 54 def getNewsDetial () in total_2 (): 55 title_all = [] 56 author_all = [] 57 commentCount_all = [] 58 article_all = [] 59 time_all = [] 60 url_all = getNewsLinkUrl () 61 for url in url_all: 62 response (getNewsdetial (url) [0]) 63 time_all.append (getNewsdetial (url) [1]) 64 response (getNewsdetial (url) [2]) 65 author_all.append (getNewsdetial (url) [3]) 66 commentCount_all.append (getCommentCount (url) 67 total_2 = {'A _ title': title_all, 'B _ article': article_all, 'c _ commentcoun': commentCount_all, 'd _ time ': time_all, 'e _ editor': author_all} 68 return total_269 70 # (start point) use the pandas module to process data and convert it into an excel document 71 72 df = pandas. dataFrame (getNewsDetial () 73 df.to_excel('news2.xlsx ')

 

The stored excel file is as follows:

 

 

TIPS:

Problem: An import error may occur when jupyter notebook imports pandas.

Solution: Do not use the command line to open jupyter notebook, directly find the software to open or open it in Anocanda Navigator.

21:49:37

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.