Crawl The embarrassing encyclopedia joke, assuming the page URL is http://www.qiushibaike.com/8hr/page/1
Requirements:
Use requests to get page information and extract data with Xpath/re
Get the,,, 用户头像链接 用户姓名 段子内容 点赞次数 and in each post评论次数
Save in JSON file
Reference Code
#qiushibaike. py#import Urllib#import RE#import ChardetImport requestsFrom lxmlImport Etreepage =1url =' http://www.qiushibaike.com/8hr/page/' + str (page) headers = {' User-agent ':' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ',' Accept-language ':' zh-cn,zh;q=0.8 '}Try:response = Requests.get (URL, headers=headers) reshtml = response.text html = etree. HTML (reshtml) result = Html.xpath ('//div[contains (@id, "Qiushi_tag")]For siteIn Result:item = {} Imgurl = Site.xpath ('./div/a/img/@src ') [0].encode (' Utf-8 ') Username = Site.xpath ('./div/a/@title ') [0].encode (' utf-8 ') #username = Site.xpath ('.//h2 ') [0].text content = Site.xpath ('.//div[@class = " Content "]/span") [0].text.strip (). Encode (' utf-8 ') # votes vote = Site.xpath ('.//i ') [0].text # Print Site.xpath ('.//*[@class = ' number '] ') [0].text # comment Info comments = Site.xpath ('.//i ') [1].text Print Imgurl, username, content, vote, commentsexcept Exception, E: print e
Examples of Python's embarrassing encyclopedia