fromlxmlImportetreeImportRequestsurl='Https://movie.douban.com/chart'Headers= {"user-agent":"mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36"}response= Requests.get (url,headers=headers) Html_str=Response.content.decode ()#print (HTML_STR)#using etree to process dataHTML =etree. HTML (HTML_STR)#get the URL address of the movieUrl_list = Html.xpath ("//div[@class = ' indent ']/div/table//div[@class = ' pl2 ']/a/@href")#print (url_list)#get a picture of a movie addressImg_list = Html.xpath ("//div[@class = ' indent ']/div/table//a[@class = ' NBG ']/img/@src")#print (img_list)#Make a dictionary of each movie, the data of the movie in the dictionary #1. Grouping #2. Each group extracts dataRETs= Html.xpath ("//div[@class = ' indent ']/div/table") forTableinchRets:item={} item['title'] = Table.xpath (".//div[@class = ' pl2 ']/a/text ()") [0].replace ("/",""). Strip () item['href'] = Table.xpath (".//div[@class = ' pl2 ']/a/@href") [0] item['img'] = Table.xpath (".//a[@class = ' NBG ']/img/@src") [0] item['Comment_num'] = Table.xpath (".//div[@class = ' pl2 ']/div//span[@class = ' pl ']/text ()") [0] item['Rating_num'] = Table.xpath (".//div[@class = ' pl2 ']/div//span[@class = ' rating_nums ']/text ()") [0]Print(item)
Python XPath crawl watercress computer edition movie case