Author: Huangga
Source: https://www.jianshu.com/p/ea0b56e3bd86
Grass so tall, blink of an eye again to March "Reptile Month".
At this time often a lot of children's shoes writing papers suffer from difficult data acquisition, and turned on the path of the reptile;
Many analysts often use reptiles when they do public-opinion monitoring or competitive analysis.
Today, this article will lead small partners through 12 of simple Python code, the first glimpse of the secret habitat of reptiles.
Reptile Target
This article uses requests + Xpath, crawls the watercress movie "The Black Panther" partial commentary content. Not much to say, code First:
Import requests; from lxml import etree; Import pandas as PD; Import time; Import random; From TQDM import TQDM
Name, score, comment = [], [], []
def danye_crawl (page):
url = ' Https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P &percent_type= '% (page*20)
Response = etree. HTML (Requests.get (URL). Content.decode (' Utf-8 '))
Print (' \ n ', ' page%s comment crawl success '% (page) ') if Requests.get (URL). Status_code = = else print (' \ n ', '%s Crawl failed ' (page))
For I in Range (1,21):
Name.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/h3/span[2]/a '% (i)) [0].text)
Score.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/h3/span[2]/span[2] '% (i)) [0].attrib[' class '][7] )
Comment.append (Response.xpath ('//*[@id = "comments"]/div[%s]/div[2]/p '% (i)) [0].text)
For I in TQDM (range ()): Danye_crawl (i); Time.sleep (Random.uniform (6, 9))
res = PD. Dataframe ({' name ': Name, ' score ': score, ' comment ': comment},columns = [' name ', ' score ', ' comment ']); Res.to_csv ("watercress. csv")
Running the reptilian script above, we were able to witness miracles
The results of the crawler and the original content of the page, exactly the same
Good interaction through the TQDM module
Tool Preparation
Chrome browser (analyze HTTP requests, grab packages)
Install Python 3 and related modules (requests, lxml, pandas, Time, random, TQDM)
Requests: used to simply request data
lxml: A faster and stronger parsing library than beautiful soup
Pandas: Data processing artifact
Time: Set crawler access interval to prevent getting caught
Random: Random number generation tool, with time to use
TQDM: Interactive tools to show the progress of program running
Basic Steps
Network Request Analysis
Web page Content Resolution
Data Read Storage
involving knowledge points
Reptile Protocol
HTTP request Analysis
Requests request
XPath syntax
Python Basics Syntax
Pandas data processing
Reptile Protocol
The reptile protocol, the robots.txt file under the root directory of the Web site, is used to tell the reptile what it can and cannot steal, in which Crawl-delay tells the site what interval it expects to be accessed. (for the other side of the server's job, civilized take data, this article will be the crawler access interval set to 6-9 seconds of random numbers)
The reptile protocol at Douban station
HTTP Request Analysis
Use the Chrome browser to visit the Black Panther Sidebar page Https://movie.douban.com/subject/6390825/comments?sort=new_score&status=P, press F12, Access to the network Panel for Network request analysis, by refreshing the Web page to regain the request, with the help of the Chrome browser to screen and analyze the request, find the TA
The page request analysis of the Watercress commentary
Through request analysis, we found the target URL is
' https://movie.douban.com/subject/6390825/comments?start=0&limit=20&sort=new_score&status=p& Percent_type= ', and every time the page is leafed, the parameter start increases by 20.
(through multiple paging attempts, we found that the 11th page needs to be logged in to view, and the login status shows only the first 500 sidebar.) As a simple demo, this article only crawls the first 11 pages of content
Requests Request
Sends a GET request through the requests module, obtains the byte data with the content method, and encodes it by utf-8, and then adds an interaction to determine whether the resource was successfully fetched (status code 200), output get state
Request Details Analysis
(In addition to content, there is the text method, which returns the Unicode character set, directly using the text method encountered in Chinese words easily garbled)
XPath Syntax parsing
After you get the data, you need to parse the content of the Web page, with regular expressions, beautiful Soup, XPath, and so on, where XPath is quick and easy. Here we get the user name, the short score, the article content and so on of the first 220 articles through the XPath parsing resource.
(You can use the powerful features of Chrome to copy xpath,xpath Grammar learning http://www.runoob.com/xpath/xpath-tutorial.html)
Data Processing
After getting the data, we construct dictionary through the list, then construct the dataframe through the dictionary, and output the data as a CSV file through the Pandas module
the conclusion and the egg
This example through the Requests+xpath plan, successfully climbed the film "Black Panther" part of the watercress commentary data, for text analysis or other data mining work good data Foundation.
This article, as a demo, only shows a simple reptile flow, more colorful eggs such as request head, request body information acquisition, cookies, analog login, distributed crawler, please pay attention to the later article update yo.
Finally, send in the vernacular version of the code:
Import requests
From lxml import etree
Import Pandas as PD
Import time
Import Random
From TQDM import TQDM
Name, score, comment = [], [], []
def danye_crawl (page):
url = ' Https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P &percent_type= '% (page*20)
Response = requests.get (URL)
Response = etree. HTML (Response.content.decode (' Utf-8 '))
If Requests.get (URL). Status_code = = 200:
Print (' \ n ', ' page%s comment Crawl succeeded '% (page))
Else
Print (' \ n ', ' page%s crawl failed ' (page))
For I in Range (1,21):
Name_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/h3/span[2]/a '% (i))
Score_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/h3/span[2]/span[2] '% (i))
Comment_list = Response.xpath ('//*[@id = ' comments ']/div[%s]/div[2]/p '% (i))
Name_element = Name_list[0].text
Score_element = Score_list[0].attrib[' class '][7]
Comment_element = Comment_list[0].text
Name.append (name_element)
Score.append (score_element)
Comment.append (comment_element)
For I in TQDM (range (11)):
Danye_crawl (i)
Time.sleep (Random.uniform (6, 9))
res = {' name ': Name, ' score ': score, ' comment ': comment}
res = PD. Dataframe (res, columns = [' name ', ' score ', ' comment '])
Res.to_csv ("watercress. csv")