A. Use cookies to access
Importrequestsheaders= {'user-agent':'mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/44.0.2403.157 safari/537.36'}cookies= {'Cookies':'bid=a3mhk2yepzw, ll= "108296"; ps=y; ue= "[email protected]"; _pk_ref.100001.8cb4=%5b%22%22%2c%22%22%2c1482650884 %2c%22https%3a%2f%2fwww.so.com%2fs%3fie%3dutf-8%26shb%3d1%26src%3dhome_so.com%26q%3dpython%2b%25e8%25b1%2586% 25e7%2593%25a3%25e6%25ba%2590%22%5d; _gat_ua-7019765-1=1; Ap=1; __utmt=1; _ga=ga1.2.1329310863.1477654711; Dbcl2= "2625855:/v89oxs4wd4"; Ck=eepo; push_noty_num=0; push_doumail_num=0; _pk_id.100001.8cb4=40c3cee75022c8e1.1477654710.8.1482652441.1482639716.; _pk_ses.100001.8cb4=*; __utma=30149280.1329310863.1477654711.1482643456.1482650885.10; __utmb=30149280.19.10.1482650885; __utmc=30149280; __utmz=30149280.1482511651.7.6.utmcsr=blog.csdn.net|utmccn= (Referral) |utmcmd=referral|utmcct=/alanzjl/article/ details/50681289; __utmv=30149280.262; _vwo_uuid_v2=64e0e442544cb2fe2d322c59f01f1115|026be912d24071903cb0ed891ae9af65'}url='http://www.douban.com'R= Requests.get (url, cookies = cookies, headers =headers) with open ('Douban_2.txt','wb+') as F:f.write (r.content)
Two. Search with XPath
import requests from lxml import etrees = requests. Session () for ID in range (0, 251, 25< Span style= "color: #000000"): print (ID)
' https://movie.douban.com/top250/?start- ' + str (id) = s.get (URL) 'utf-8' = = Root.xpath ('//ol/li/div[@class = "item"] ')//Using XPath's tag selection
# print (len (items)) for inch Items: = Item.xpath ('./div[@class = "Info"]//a/span[@class = "title"]/text ()')// If you find the Chinese name below
= Title[0].encode ('gb2312'ignore'). Decode (' gb2312')//title is an array first encoding again decode make sure the characters don't mix together # rank = Item.xpath ('./div[@class = "pic"]/em/text () ') [0] Rating = Item.xpath ('.//div[@class = "BD"]//span[@class = "Rating_num"]/text ()' ) [0]
Print (Name, rating)
Result: Top 250 ratings for successful crawls
PS: Must know the structure of the Web page
Python crawler Knowledge Point three--analysis of watercress top250 data