Python crawler: capture Sina news data, python Sina news
Case 1
Capture object:
Sina domestic news (http://news.sina.com.cn/china/), the title of the List name, time, Link.
Complete code:
1 from bs4 import BeautifulSoup 2 import requests 3 4 url = 'http://news.sina.com.cn/china/' 5 web_data = requests.get(url) 6 web_data.encoding = 'utf-8' 7 soup = BeautifulSoup(web_data.text,'lxml') 8 9 for news in soup.select('.news-item'):10 if(len(news.select('h2')) > 0):11 h2 = news.select('h2')[0].text12 time = news.select('.time')[0].text13 a = news.select('a')[0]['href']14 print(h2,time,a)
Running result:(Only show part)
Detailed explanation:
1. First, insert the required libraries: BeautifulSoup and requests, and then parse the webpage. After parsing, print and confirm whether the parsing is correct.
1 from bs4 import BeautifulSoup2 import requests3 4 url = 'http://news.sina.com.cn/china/'5 web_data = requests.get(url)6 soup = BeautifulSoup(web_data.text,'lxml')7 print(soup)
At this time, we can see that the parsed webpage contains a lot of garbled characters and is not correctly parsed. Observe the result and see the beginning of this sentence:
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
[Charset = UTF-8] indicates that the character set of the current content adopts the UTF-8 encoding format. Therefore, we need to use encoding to unlock the content and parse the normal content.
1 from bs4 import BeautifulSoup2 import requests3 4 url = 'http://news.sina.com.cn/china/'5 web_data = requests.get(url)6 web_data.encoding = 'utf-8'7 soup = BeautifulSoup(web_data.text,'lxml')8 print(soup)
2. After parsing the webpage, start to capture the content we need. First, I would like to add some additional knowledge.
Take a look at the first line in the following code, soup. select ('. news-item'), to retrieve elements containing specific CSS attributes, such:
- Find out all elements whose class is news-item. Before the class name, add (.), that is, periods in English;
- Find all the elements whose id is artibodyTitle, and add the pound sign (#) before the id name (#).
In addition, when obtaining HTML elements containing specific tags, you can directly write the tag name after select, for example, the 3rd rows in the for loop below, news. select ('h2 ').
1 for news in soup.select('.news-item'):2 # print(news)3 if(len(news.select('h2')) > 0):4 # print(news.select('h2')[0].text)5 h2 = news.select('h2')[0].text6 time = news.select('.time')[0].text7 a = news.select('a')[0]['href']8 print(h2,time,a)
Now let's take a closer look at the meaning of each line of this Code.
Row 3: soup. select ('. news-item'), and retrieve the elements in the news-item class;
Row 3: print the news to check whether the resolution is normal. continue after the resolution is normal. You can comment it out when it is not in use;
Row 3rd: by observing the code, we can see that the title is stored in the label h2. If the length of h2 is greater than 0, this is to remove the title data that is empty;
Row 3: news in print. select ('h2 ') [0]. text, [0] is the first element in the list, text is the text data, print and check whether it is correct, can be commented out when not;
Row 5th: stores news. select ('h2 ') [0]. text in the variable h2;
Row 3: time is of the class type. It is expressed by adding a dot above. Same as above, data is stored in the variable time;
Row 7th: the link to be crawled is stored in tag a. The link is no longer text, and href is used to store the link data in variable;
Row 3: output the data we want to capture, including the title, time, and link.
Case 2
Capture object:
Capture the title, time (format conversion), news sources, news details, and the Ministry of Land and Resources of the news: a daily report system for geological disasters during flood periods was implemented from May to September.
17:21:00
CCTV News
Original Title: Ministry of Land and Resources: Daily Report System for High Geological Disasters
According to the Ministry of Land and Resources, this year will gradually enter the high incidence of geological disasters, and the situation of disaster prevention and mitigation will become more severe. According to the forecast of the China Meteorological Administration, most of the precipitation in China's southern region, East China, and northwest China will be more frequent in the same period of the year in January May. It is necessary to strengthen the prevention of geological disasters such as landslides and debris flows induced by extreme meteorological events. The Ministry of Land and Resources Emergency Office implemented a daily report system for geological disasters in flood periods from January 1, May to January 1, September, all localities must deploy daily disaster situations and their major work to report to the department of land and resources emergency office before three o'clock P.M. that day.
Li Weishan
4
Fyeycfp9425908
Detailed explanation:
1. first, insert the required libraries: BeautifulSoup, requests, datetime (time processing), json (decoding: decodes json strings into Python objects), and re (regular expression ), then parse the webpage.
Capture the title first:
1 from bs4 import BeautifulSoup 2 import requests 3 from datetime import datetime 4 import json 5 import re 6 7 url = 'http://news.sina.com.cn/c/nd/2017-05-08/doc-ifyeycfp9368908.shtml' 8 web_data = requests.get(url) 9 web_data.encoding = 'utf-8'10 soup = BeautifulSoup(web_data.text,'lxml')11 title = soup.select('#artibodyTitle')[0].text12 print(title)
Datetime, json, And re are used in subsequent time conversion and Data Capturing from js.
Let's take a look at the second-to-last line of code: title = soup. select ('# artibodytitle') [0]. text. The usage here is the same as that in Case 1. Before the id, the pound sign (#) is used to represent the position of the element. to unlock a unique element, use [0] to extract text information.
2. Capture the time and convert the original date format to the standard format:
1 # time = soup. select ('. time-source ') [0] 2 # print (time) 3 time = soup. select ('. time-source ') [0]. contents [0]. strip () 4 dt = datetime. strptime (time, '% Y % m month % d % H: % m') 5 print (dt)
Row 3: capture time;
Row 3: Time in print. The running result contains both time and news sources, as shown below:
<Span class = "time-source" id = "navtimeSource"> May 08, 2017 <span> <span data-sudaclick = "media_name"> <a href = "http://m.news.cctv.com/2017/05/08/ARTIPEcvpHjWzuGDPWQhn77z170508.shtml" rel =" nofollow "target =" _ blank "> CCTV News </a> </span>
Then, we need to find a way to separate the time and the source. In this case, we need to use contents;
Row 3: we add the following. contents, after running, we can see that the above content will be divided into the following two elements in the list. At this time, we take the first element and add [0] After contents. last. strip () can remove \ t at the end of time;
['2017 \ t \ t ', <span> <span data-sudaclick = "media_name"> <a href = "http://m.news.cctv.com/2017/05/08/ARTIPEcvpHjWzuGDPWQhn77z170508.shtml" rel = "nofollow" target = "_ blank"> CCTV News </a> </ span> </span>, '\ n']
Row 4th: datetime. strptime () is used for time formatting. The original time is the time of year, month, day, and so the conversion uses the time of year, month, day, % Y indicates the four-digit year, % m indicates the month, % d indicates the day of the month, % H indicates the hour in the 24-hour system, and % M indicates the minutes, and store it in the variable dt;
Row 5th: Output dt, which returns the formatted time, for example, 17:21:00.
3. Capture news sources:
As mentioned in the previous article "Python crawler: crawling data where everyone is a product manager", you can use [Copy selector] to Copy and paste the position of the news source. The first line is as follows; you can also use the class expression method frequently used in this article to describe its location, as shown in the second line below;
1 # source = soup.select('#navtimeSource > span > span > a')[0].text2 source = soup.select('.time-source span span a')[0].text3 print(source)
4. Capture news details:
1 article = []2 for p in soup.select('#artibody p')[:-1]:3 article.append(p.text.strip())4 # print(article)5 print('\n'.join(article))
Row 1st: article is an empty list;
Row 3: Observe the code and find that the news details are stored in the p tag. If the information is output directly, you can see the information that has the responsibility to be edited in the last column. If you do not want to edit the information, add [:-1] to remove the last Edit information;
Row 3rd: inserts the captured data into the article list. strip () removes the blank information;
Row 4th: print the results to check whether the results are correct. The original title and body of the news details are in a row. It is not very good. After commenting out the details, we will use a new method;
Row 5th: The join () method is used to concatenate the elements in the list with the specified characters to generate a new string. Here, the \ n line feed is used to connect the original title and body in the list.
The above is a method for connecting articles, and of course there is a simple way to write. Let's take out the for loop above and add p. text. strip (), and then add brackets as a whole to form a list. Then join the list with join, and replace the above multi-line code with a simple line of code.
1 print('\n'.join([p.text.strip() for p in soup.select('#artibody p')[:-1]]))
5. Edit capture responsibility:
Here, the lstrip is used to remove the content on the left. 'edit with responsibility: 'In parentheses refers to removing this part of content and retaining only the name of the edited person. Strip is used to remove both sides, lstrip is used to remove the left side, and rstrip is used to remove the right side.
1 editor = soup. select ('. article-editor') [0]. text. lstrip ('responsible editor:') 2 print (editor)
6. Capture comments:
1 # comments = soup.select('#commentCount1')2 # print(comment)3 comments = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-fyeycfp9368908&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20')4 # print(comments.text)5 comments_total = json.loads(comments.text.strip('var data='))6 # print(comments_total)7 print(comments_total['result']['count']['total'])
First, select ('# commentCount1') to filter out the number of comments. Then, the output result in print is [<span id = "commentCount1"> </span>], there is no blank information in the center of span, and there is no comment we want.
At this time, we need to observe the webpage code again and find that the number of comments may be increased through JavaScript, so we need to find out where to call JavaScript (that is, JS.
Place the cursor over the number of comments (4), right-click the Google browser, click "check", select the network on the top, and find the number of 4 links in the following massive links.
Copy the link. The bold part in the middle is newid. Finally, there is a timestamp string "& jsvar = loader_1494295822737_91802706". We need to remove this part and try print it, the results will not be affected.
http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-fyeycfp9368908&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20&jsvar=loader_1494295822737_91802706
After processing the link, how can we read data from JavaScript? At this time, we need to use json for processing. At the beginning, we have imported json when importing data to the database. here we can use it directly.
Json. loads is used to decode json data. Row 3: we store the decoded data in the variable comments_total. Row 3: When we print the variable, we get the result. Besides the number of comments, we also find some other information.
So we need to output it again. Based on the print result of the previous line, we can observe and find that, in the last line, we can use ['result'] ['Count'] ['Total'] to represent the location of the number of comments.
7. Capture the news ID:
1 print(news_url.split('/')[-1].rstrip('.shtml').lstrip('doc-i'))
When we captured the number of comments at the top 6th, we found a newsid in the link, which also contains the same ID in the link of the news page. At this time, we can determine the position of the news ID.
Remove doc-I on the left using lstrip to get the ID of the news we want.
In addition to this method, we can also use regular expressions for expression. At this time, we need to use the re library. At the beginning, we have already imported the re, which is directly used here.
Group (1) is the display in (. +), that is, the ID of the news we want.
1 news_id = re.search('doc-i(.+).shtml',news_url)2 # print(news_id.group(0))3 print(news_id.group(1))
Operating Environment: Python version, 3.6; PyCharm version, 2016.2; Computer: Mac
----- End -----
Author: du wangdan, Public Account: du wangdan, Internet product manager.