12 lines of Python violence climbing "Black Panther" watercress commentary _

12 lines of Python violence climbing "Black Panther" watercress commentary __python

Last Update:2018-07-24 Source: Internet

Author: User

Tags sleep xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Huangga

Source: https://www.jianshu.com/p/ea0b56e3bd86

Grass so tall, blink of an eye again to March "Reptile Month".
At this time often a lot of children's shoes writing papers suffer from difficult data acquisition, and turned on the path of the reptile;
Many analysts often use reptiles when they do public-opinion monitoring or competitive analysis.

Today, this article will lead small partners through 12 of simple Python code, the first glimpse of the secret habitat of reptiles.
Reptile Target

This article uses requests + Xpath, crawls the watercress movie "The Black Panther" partial commentary content. Not much to say, code First:

Import requests; from lxml import etree; Import pandas as PD; Import time; Import random; From TQDM import TQDM
Name, score, comment = [], [], []
def danye_crawl (page):
url = ' Https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P &percent_type= '% (page*20)
Response = etree. HTML (Requests.get (URL). Content.decode (' Utf-8 '))
Print (' \ n ', ' page%s comment crawl success '% (page) ') if Requests.get (URL). Status_code = = else print (' \ n ', '%s Crawl failed ' (page))
For I in Range (1,21):
Name.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/h3/span[2]/a '% (i)) [0].text)
Score.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/h3/span[2]/span[2] '% (i)) [0].attrib[' class '][7] )
Comment.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/p '% (i)) [0].text)
For I in TQDM (range ()): Danye_crawl (i); Time.sleep (Random.uniform (6, 9))
res = PD. Dataframe ({' name ': Name, ' score ': score, ' comment ': comment},columns = [' name ', ' score ', ' comment ']); Res.to_csv ("watercress. csv")

Running the reptilian script above, we were able to witness miracles

The results of the crawler and the original content of the page, exactly the same

Good interaction through the TQDM module

Tool Preparation

Chrome browser (analyze HTTP requests, grab packages)

Install Python 3 and related modules (requests, lxml, pandas, Time, random, TQDM)
Requests: used to simply request data
lxml: A faster and stronger parsing library than beautiful soup
Pandas: Data processing artifact
Time: Set crawler access interval to prevent getting caught
Random: Random number generation tool, with time to use
TQDM: Interactive tools to show the progress of program running

Basic Steps

Network Request Analysis

Web page Content Resolution

Data Read Storage

involving knowledge points

Reptile Protocol

HTTP request Analysis

Requests request

XPath syntax

Python Basics Syntax

Pandas data processing

Reptile Protocol

The reptile protocol, the robots.txt file under the root directory of the Web site, is used to tell the reptile what it can and cannot steal, in which Crawl-delay tells the site what interval it expects to be accessed. (for the other side of the server's job, civilized take data, this article will be the crawler access interval set to 6-9 seconds of random numbers)

The reptile protocol at Douban station

HTTP Request Analysis

Use the Chrome browser to visit the Black Panther Sidebar page Https://movie.douban.com/subject/6390825/comments?sort=new_score&status=P, press F12, Access to the network Panel for Network request analysis, by refreshing the Web page to regain the request, with the help of the Chrome browser to screen and analyze the request, find the TA

The page request analysis of the Watercress commentary

Through request analysis, we found the target URL is
' https://movie.douban.com/subject/6390825/comments?start=0&limit=20&sort=new_score&status=p& Percent_type= ', and every time the page is leafed, the parameter start increases by 20.
(through multiple paging attempts, we found that the 11th page needs to be logged in to view, and the login status shows only the first 500 sidebar.) As a simple demo, this article only crawls the first 11 pages of content

Requests Request

Sends a GET request through the requests module, obtains the byte data with the content method, and encodes it by utf-8, and then adds an interaction to determine whether the resource was successfully fetched (status code 200), output get state

Request Details Analysis

(In addition to content, there is the text method, which returns the Unicode character set, directly using the text method encountered in Chinese words easily garbled)

XPath Syntax parsing

After you get the data, you need to parse the content of the Web page, with regular expressions, beautiful Soup, XPath, and so on, where XPath is quick and easy. Here we get the user name, the short score, the article content and so on of the first 220 articles through the XPath parsing resource.
(You can use the powerful features of Chrome to copy xpath,xpath Grammar learning http://www.runoob.com/xpath/xpath-tutorial.html)
Data Processing

After getting the data, we construct dictionary through the list, then construct the dataframe through the dictionary, and output the data as a CSV file through the Pandas module

the conclusion and the egg

This example through the Requests+xpath plan, successfully climbed the film "Black Panther" part of the watercress commentary data, for text analysis or other data mining work good data Foundation.
This article, as a demo, only shows a simple reptile flow, more colorful eggs such as request head, request body information acquisition, cookies, analog login, distributed crawler, please pay attention to the later article update yo.

Finally, send in the vernacular version of the code:

Import requests
From lxml import etree
Import Pandas as PD
Import time
Import Random
From TQDM import TQDM

Name, score, comment = [], [], []

def danye_crawl (page):
url = ' Https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P &percent_type= '% (page*20)
Response = requests.get (URL)
Response = etree. HTML (Response.content.decode (' Utf-8 '))
If Requests.get (URL). Status_code = = 200:
Print (' \ n ', ' page%s comment Crawl succeeded '% (page))
Else
Print (' \ n ', ' page%s crawl failed ' (page))

For I in Range (1,21):
Name_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/h3/span[2]/a '% (i))
Score_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/h3/span[2]/span[2] '% (i))
Comment_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/p '% (i))

Name_element = Name_list[0].text
Score_element = Score_list[0].attrib[' class '][7]
Comment_element = Comment_list[0].text

Name.append (name_element)
Score.append (score_element)
Comment.append (comment_element)

For I in TQDM (range (11)):
Danye_crawl (i)
Time.sleep (Random.uniform (6, 9))

res = {' name ': Name, ' score ': score, ' comment ': comment}
res = PD. Dataframe (res, columns = [' name ', ' score ', ' comment '])
Res.to_csv ("watercress. csv")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More