1-Problem description
Grab the watercress "new Book Express"[1] page book information (including title, author, profile, url) and redirect the results to a txt text file.
2-Thinking analysis [2]
STEP1 reading HTML
STEP2 XPath traversal elements and attributes
3-Using tools
Python,lxml module, requests module
4-Program Implementation
1 #-*-coding:utf-8-*-2 fromlxmlImportHTML3 ImportRequests4 5 6page = Requests.get ('Http://book.douban.com/latest?icn=index-latestbook-all')7Tree =html.fromstring (Page.text)8 9 #If you saved the HTML file, you can use the following methodTen #page = open ('/home/freyr/codehouse/python/512.htm ', ' R '). Read () One #tree = html.fromstring (page) A - #Extracting book Information -BookName = Tree.xpath ('//div[@class = "Detail-frame"]/h2/text ()')#title theAuthor = Tree.xpath ('//div[@class = "Detail-frame"]/p[@class = "Color-gray"]/text ()')#author -info = Tree.xpath ('//div[@class = "Detail-frame"]/p[2]/text ()')#Introduction -url = Tree.xpath ('//ul[@class = "cover-col-4 clearfix"]/li/a[@href]')#URL - +Booknames = Map (LambdaX:x.strip (), bookname) -Authors = map (LambdaX:x.strip (), author) +Infos = Map (LambdaX:x.strip (), info) AURLs = map (Lambdap:p.values () [0], URL) at -With open ('/home/freyr/codehouse/python/dbbook.txt','w+') as F: - forBook, author, info, urlinchZip (booknames, authors, infos, URLs): -F.write ('%s\n\n%s\n\n%s'% (Book.encode ('Utf-8'), Author.encode ('Utf-8'), Info.encode ('Utf-8'))) -F.write ('\n\n%s\n'%URL) -F.write ('\ n \-----------------------------------------\n\n\n')
Ps:1. Have not really started to learn the web crawler, the first simple record.
2. Procedure involves coding issues [3]
[1] Watercress-new book Express
[2] lxml and requests
[3] lxml Chinese garbled
Python crawler Bean-book Express-book Analysis