We open Sina News, see the page below, first to crawl a level URL, the picture of the Blue circle section
Second zh picture, display need paging,
Source:
#Coding:utf-8ImportJSONImportRedisImport TimeImportrequestssession=requests.session ()Importlogging.handlersImportPickleImportSYSImportReImportdatetime fromBs4ImportBeautifulSoupImportsysreload (SYS) sys.setdefaultencoding ('UTF8')Importdatetime#generate a date for one yeardefDateRange (Start, End, Step=1, format="%y-%m-%d"): Strptime, Strftime=datetime.datetime.strptime, Datetime.datetime.strftime days= (Strptime (end, format)-strptime (start, Format)). Daysreturn[Strftime (Strptime (start, format) + Datetime.timedelta (i), format) forIinchxrange (0, days, step)]defSpider (): Date_list= DateRange ("2017-01-01","2018-01-06") [::-1] Printdate_list forDateinchdate_list: forPageinchRange (1,5): #Combo URLURL ="http://roll.mil.news.sina.com.cn/col/zgjq/"+ str (DATE) +"_"+ str (page) +". shtml" #Masquerade Request Headerheaders = { "Host":"roll.mil.news.sina.com.cn", "Cache-control":"max-age=0", "upgrade-insecure-requests":"1", "user-agent":"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/60.0.3112.113 safari/537.36", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-encoding":"gzip, deflate", "Accept-language":"zh-cn,zh;q=0.8", "if-modified-since":"Sat, Jan 2018 09:57:24 GMT",} result= Session.get (url=url,headers=headers). Content#encoding format is gb2312, use BeautifulSoup to resolve encoding formatSoup = BeautifulSoup (Result,'Html.parser') #Find a news listResult_div = Soup.find_all ('Div', attrs={"class":"fixlist"}) [0]#go and change the line.Result_replace = str (result_div). Replace ('\ n',"'). Replace ('\ r',"'). Replace ('\ t',"') #regular Match InformationResult_list = Re.findall ('<li> (. *?) </li>', Result_replace) forIinchresult_list:#match out News URL, name,timeNews_url= Re.findall ('<a href= "(. *?)" target=', I) [0] News_name= Re.findall ('target= "_blank" > (. *?) </a>', I) [0] News_time= Re.findall ('<span class= "Time" >\ ((. *?) \) </span>', I) [0]PrintNews_urlPrintNews_namePrintNews_timespider ()
Python Crawler Instance (7)--Crawling Sina military News