Crawls Sina news AFCCL and Sina news AFCCL
1. task objectives:
Crawls AFCCL articles from Sina news, including the title, time, source, content, and number of comments.
2. Target webpage:
Http://sports.sina.com.cn/z/AFCCL/
3. webpage Analysis
4. Source Code:
#! /Usr/bin/env/python # coding: utf-8import sysimport requestsfrom bs4 import BeautifulSoupimport jsonimport reif _ name _ = '_ main _': url = 'HTTP: // sports.sina.com.cn/z/AFCCL/'res = requests. get (url) html_doc = res. contentsoup = BeautifulSoup (html_doc, 'html. parser ') a_list = [] # crawl the news time, title, and link for news in soup. select ('. news-item '): if (len (news. select ('h2 ')> 0): h2 = news. select ('h2 ') [0]. texta = news. select ('A ')[ 0] ['href '] time = news. select ('. time') [0]. text # print (time, h2, a) a_list.append (a) # crawl the internal document for I in range (len (a_list): url = a_list [I] res = requests. get (url) html_doc = res. contentsoup = BeautifulSoup (html_doc, 'html. parser ') # obtains the title, time, source, and content of an article. The number of comments is title = soup. select ('# j_title') if title: title = soup. select ('# j_title') [0]. text. strip () time = soup. select ('. article-a _ Time') [0]. text. strip () source = soup. select ('. art Icle-a _ source ') [0]. text. strip () content = soup. select ('. article-a _ content ') [0]. text. strip () # dynamically generate the Ajax url eg: 'http: // comment5.news.sina.com.cn/page/info? Version = 1 & format = js & channel = ty & newsid = comos-fykiuaz1429964 & group = & compress = 0 & ie = UTF-8 & oe = UTF-8 & page = 1 & page_size = 20 & jsvar = loader_1504416797470_64712661 '# print (url) pattern_id = R' (fyk \ w *). s? Html '# print (re. search (pattern_id, url ). group (1) id = re. search (pattern_id, url ). group (1) url = 'HTTP: // comment5.news.sina.com.cn/page/info? Version = 1 & format = js & channel = ty & newsid = comos-'+ id +' & group = & compress = 0 & ie = UTF-8 & oe = UTF-8 & page = 1 & page_size = 20 'comments = requests. get (url) jd = json. loads (comments. text. strip ('var data = ') commentCount = jd ['result'] ['Count'] ['Total'] # comment count print (time, title, source, content) print (commentCount)
5. Running result:
6. Summary:
The crawling of resources obtained by a request is relatively smooth. For resources of asynchronous requests, you need to check the checker, find the request where the resource is located, and perform positive crawling.
Eg: The crawling of "number of comments and comments.