Second day, busy home some things, shun with people to crawl the watercress book top250
1. Construct the URLs list urls=[' https://book.douban.com/top250?start={} '. Format (str (i) for I in range (0, 226, 25))]
2. Module requests get webpage source code lxml Parse Web page XPath extract
3. Extracting information
4, can be encapsulated into a function here does not encapsulate the call
Python code:
#coding: Utf-8import sysreload (SYS) sys.setdefaultencoding (' Utf-8 ') from lxml import etreeimport requestsurls=[' HTTPS ://book.douban.com/top250?start={} '. Format (str (i) for I in range (0, 226, +))]for URL in urls:html=requests.get (URL). c Ontent Selector=etree. HTML (HTML) infos=selector.xpath ('//tr[@class = "item"] ') for info in infos:book_name = Info.xpath (' td/div/a/@t Itle ') [0] Book_url = Info.xpath (' td/div/a/@href ') [0] Published_infos = str (Info.xpath (' Td/p/text () ') [0]) Splitlistinfos = Published_infos.split ('/') #print Splitlistinfos published_date=str (Splitlistinfos[-2]) #print published_date price = str (splitlistinfos[-1]) #print Price rate = Info.xpath (' td/div/ Span[2]/text () ') [0] # comment_nums = Info.xpath (' Td/div/span[3]/text () ') [0] # Print Comment_nums comm Ent_nums = Info.xpath (' Td/div/span[3]/text () ') [0].strip (' ('). Strip (). Strip (') '). Strip (). Strip (' Person rating '). Strip () + ' People rating ' IntrodUceinfo = Info.xpath (' Td/p/span/text () ') Print Book_name,book_url,published_date,price,rate,comment_nums,introduce Info[0] If Len (introduceinfo) > 0 Else '
Python 2.7_ uses XPath syntax to crawl top250 information in a watercress book _20170129