標籤:
原理和上章擷取段子一樣,只不過是換瞭解析的內容。
代碼:
#-*- coding: utf-8 -*-import urllib2import redef GetPageContent(page_url,heads): try: req = urllib2.Request(page_url,headers=heads) resp = urllib2.urlopen(req) return resp.read().decode('utf8') except Exception, e: print "Request [%s] error. -> "%(page_url), e return ""def GetTopNotes(cont): strRe = '.*?<li>.*?data-user-slug="(.*?)"' strRe += '.*?<h4>.*?<a.*?href="(.*?)".*?>(.*?)</a>' strRe += '.*?class="fa fa-comments-o".*?>.*?</i>(.*?)</a>' strRe += '.*?<a.*?id="like-note".*?</i>(.*?)</a>' pat = re.compile(strRe, re.S) items = re.findall(pat,cont) for item in items: for i in item: print "".join(i.split()) print '==================================='if __name__ == '__main__': url = 'http://www.jianshu.com/trending/now' user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent':user_agent} cont = GetPageContent(url, headers) cont = cont[cont.find('<ul class="top-notes ranking">')::] GetTopNotes(cont)
輸出:
C:\Python27\python.exe F:/SrcCode/Python/GetNewlyJokes/JianShuSpider.py4c4231dc6796/p/0aabe4120b78下水道的秘密4820===================================564d899d4d3c/p/8af1ad733670蟬鳴的夏季我想遇見你11771===================================a36e18ccb59d/p/f9e60eb98a28再見,愛過的人846===================================bcfca792018f/p/9fa6b6e58fd0我們曾相遇,想到就心酸(三十五)1927===================================2870cb3c6f77/p/8329df311356最佳情人39288===================================dc22650a4033/p/f7f39b72fdb2【連載】觸不到的女神(10)3121===================================
內容一次為:作者id,文章連結,文章標題,評論數,收到的喜歡數
Python抓取簡書的熱門文章