12 lines of Python violence climbing "Black Panther" watercress commentary __python

Source: Internet
Author: User
Tags sleep xpath

Author: Huangga

Source: https://www.jianshu.com/p/ea0b56e3bd86


Grass so tall, blink of an eye again to March "Reptile Month".
At this time often a lot of children's shoes writing papers suffer from difficult data acquisition, and turned on the path of the reptile;
Many analysts often use reptiles when they do public-opinion monitoring or competitive analysis.


Today, this article will lead small partners through 12 of simple Python code, the first glimpse of the secret habitat of reptiles.
Reptile Target


This article uses requests + Xpath, crawls the watercress movie "The Black Panther" partial commentary content. Not much to say, code First:

Import requests; from lxml import etree; Import pandas as PD; Import time; Import random; From TQDM import TQDM
Name, score, comment = [], [], []
def danye_crawl (page):
url = ' Https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P &percent_type= '% (page*20)
Response = etree. HTML (Requests.get (URL). Content.decode (' Utf-8 '))
Print (' \ n ', ' page%s comment crawl success '% (page) ') if Requests.get (URL). Status_code = = else print (' \ n ', '%s Crawl failed ' (page))
For I in Range (1,21):
Name.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/h3/span[2]/a '% (i)) [0].text)
Score.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/h3/span[2]/span[2] '% (i)) [0].attrib[' class '][7] )
Comment.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/p '% (i)) [0].text)
For I in TQDM (range ()): Danye_crawl (i); Time.sleep (Random.uniform (6, 9))
res = PD. Dataframe ({' name ': Name, ' score ': score, ' comment ': comment},columns = [' name ', ' score ', ' comment ']); Res.to_csv ("watercress. csv")

Running the reptilian script above, we were able to witness miracles


The results of the crawler and the original content of the page, exactly the same


Good interaction through the TQDM module


Tool Preparation


Chrome browser (analyze HTTP requests, grab packages)

Install Python 3 and related modules (requests, lxml, pandas, Time, random, TQDM)
Requests: used to simply request data
lxml: A faster and stronger parsing library than beautiful soup
Pandas: Data processing artifact
Time: Set crawler access interval to prevent getting caught
Random: Random number generation tool, with time to use
TQDM: Interactive tools to show the progress of program running


Basic Steps


Network Request Analysis

Web page Content Resolution

Data Read Storage


involving knowledge points


Reptile Protocol

HTTP request Analysis

Requests request

XPath syntax

Python Basics Syntax

Pandas data processing


Reptile Protocol


The reptile protocol, the robots.txt file under the root directory of the Web site, is used to tell the reptile what it can and cannot steal, in which Crawl-delay tells the site what interval it expects to be accessed. (for the other side of the server's job, civilized take data, this article will be the crawler access interval set to 6-9 seconds of random numbers)


The reptile protocol at Douban station


HTTP Request Analysis


Use the Chrome browser to visit the Black Panther Sidebar page Https://movie.douban.com/subject/6390825/comments?sort=new_score&status=P, press F12, Access to the network Panel for Network request analysis, by refreshing the Web page to regain the request, with the help of the Chrome browser to screen and analyze the request, find the TA

The page request analysis of the Watercress commentary


Through request analysis, we found the target URL is
' https://movie.douban.com/subject/6390825/comments?start=0&limit=20&sort=new_score&status=p& Percent_type= ', and every time the page is leafed, the parameter start increases by 20.
(through multiple paging attempts, we found that the 11th page needs to be logged in to view, and the login status shows only the first 500 sidebar.) As a simple demo, this article only crawls the first 11 pages of content


Requests Request


Sends a GET request through the requests module, obtains the byte data with the content method, and encodes it by utf-8, and then adds an interaction to determine whether the resource was successfully fetched (status code 200), output get state

Request Details Analysis


(In addition to content, there is the text method, which returns the Unicode character set, directly using the text method encountered in Chinese words easily garbled)


XPath Syntax parsing


After you get the data, you need to parse the content of the Web page, with regular expressions, beautiful Soup, XPath, and so on, where XPath is quick and easy. Here we get the user name, the short score, the article content and so on of the first 220 articles through the XPath parsing resource.
(You can use the powerful features of Chrome to copy xpath,xpath Grammar learning http://www.runoob.com/xpath/xpath-tutorial.html)
Data Processing


After getting the data, we construct dictionary through the list, then construct the dataframe through the dictionary, and output the data as a CSV file through the Pandas module


the conclusion and the egg


This example through the Requests+xpath plan, successfully climbed the film "Black Panther" part of the watercress commentary data, for text analysis or other data mining work good data Foundation.
This article, as a demo, only shows a simple reptile flow, more colorful eggs such as request head, request body information acquisition, cookies, analog login, distributed crawler, please pay attention to the later article update yo.


Finally, send in the vernacular version of the code:

Import requests
From lxml import etree
Import Pandas as PD
Import time
Import Random
From TQDM import TQDM

Name, score, comment = [], [], []

def danye_crawl (page):
url = ' Https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P &percent_type= '% (page*20)
Response = requests.get (URL)
Response = etree. HTML (Response.content.decode (' Utf-8 '))
If Requests.get (URL). Status_code = = 200:
Print (' \ n ', ' page%s comment Crawl succeeded '% (page))
Else
Print (' \ n ', ' page%s crawl failed ' (page))

For I in Range (1,21):
Name_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/h3/span[2]/a '% (i))
Score_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/h3/span[2]/span[2] '% (i))
Comment_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/p '% (i))

Name_element = Name_list[0].text
Score_element = Score_list[0].attrib[' class '][7]
Comment_element = Comment_list[0].text

Name.append (name_element)
Score.append (score_element)
Comment.append (comment_element)

For I in TQDM (range (11)):
Danye_crawl (i)
Time.sleep (Random.uniform (6, 9))

res = {' name ': Name, ' score ': score, ' comment ': comment}
res = PD. Dataframe (res, columns = [' name ', ' score ', ' comment '])
Res.to_csv ("watercress. csv")


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.