This article mainly compares BeautifulSoup and selenium crawl watercress Top250 movie information, both methods are essentially the same, are analyzed by the DOM tree structure of the Web page element positioning, and then directed to crawl specific movie information, through the code of comparison, You can further deepen the impression of Python crawlers. At the same time, the article gives me the basic knowledge of reptiles before the introduction, convenient for beginners to learn.
In Short, hope the article is helpful to you, if there is a good or wrong place, also please Haihan ~
I. DOM tree structure analysis
Watercress Top250 movie URL: Https://movie.douban.com/top250?format=text
By right-clicking on the Chrome browser "review element" or "check" you can locate specific elements as shown in:
The picture consists of a movie, which corresponds in HTML:
<li><div class= "Item" >......</div></li>
BeautifulSoup through Soup.find_all (attrs={"class": "Item"}) function can get specific information, and then locate the specific content, such as <span class= "title" > Get title,< Div class= "star" > Get the score and number of reviews.
The next step to note is that the positioning of crawling page elements also requires paging, usually in two ways:
1. Click on the next page to analyze URL URLs to analyze the rules between them;
2.Selenium can get the page Number button to automatically click Jump.
As shown, click on the different page numbers to parse the URL:
2nd Page url:https://movie.douban.com/top250?start=25&filter=
3rd page url:https://movie.douban.com/top250?start=50&filter=
So a total of 25 movies per page, they exist regularly, and then write a loop to get all the movie information.
two. BeautifulSoup Crawling bean Information
Getting Started: [Python knowledge] Crawler Knowledge BeautifulSoup Library installation and brief introduction
Specific code is as follows:
#-*-Coding:utf-8-*-"" "Created on 2016-12-29 22:50@author:easstmount" "" Import urllib2 import re from BS4 import Beauti Fulsoupimport codecs# crawler function def crawl (URL): page = Urllib2.urlopen (URL) contents = Page.read () soup = Beautifulsou P (contents, "Html.parser") print U ' watercress movie 250: serial number \ t movie name \ t rating \ t rated number ' Infofile.write (u "watercress movie 250: Serial number \ t film name \ t score \ t rating number \r\ n ") the print U ' Crawl information is as follows: \ n ' for tag in Soup.find_all (attrs={" class ":" Item "}): #print tag #爬取序号 num = Tag.find (' em '). Get_text () print num #爬取电影名称 name = Tag.find (attrs={"class": "HD"}). A.get_text () Name = Name.replace (' \ n ', ') Print name Infofile.write (num+ "" +name+ "\ r \ n") #电影名称 title = Tag.find_all (attrs={"class": "title"}) i = 0 for n in title:text = N.get_text () text = Text.replace ('/', ') Text = Text.lstrip () If I==0:print u ' [Chinese title] ', text Infofile.write (U "[Chinese Title] "+ text +" \ r \ n ") elif i==1:print u ' [English title] ', Text infofile.write (U" [English title] " + text + "\ r \ n") i = i + 1 #爬取评分和评论数 info = tag.find (attrs={"class": "Star"}). Get_text () I NFO = Info.replace (' \ n ', ') info = Info.lstrip () print Info mode = Re.compile (R ' \d+\.) \d* ') print Mode.findall (info) i = 0 for n in Mode.findall (info): if i==0: Print u ' [score] ', N infofile.write (U "[score]" + n + "\ r \") elif i==1:print u ' [Comment] ', n Infofile.write (U "[Comment]" + n + "\ r \ Nand") i = i + 1 #获取评语 info = tag.find (attrs={"Clas S ":" Inq "}) if (info): # 132 movies [lost lover] no critic content = Info.get_text () print U ' [film review] ', content Infofile.write (U "[film]" + content + "\ r \ n") print ' #主函数if __name__ = = ' __main__ ': Infofile = Codecs.open ("Result_douban.TXT ", ' a ', ' utf-8 ') url = ' Http://movie.douban.com/top250?format=text ' i = 0 while i<10:print u ' page Code ', (i+1) num = i*25 #每次显示25部 URL ordinal by 25 add url = ' https://movie.douban.com/top250?start= ' + str (num) + ' &f ilter= ' Crawl (URL) infofile.write ("\r\n\r\n\r\n") i = i + 1 infofile.close ()
the output results are as follows:
watercress movie 250: Number of film names rating 1 Shawshank Redemption / the Shawshank Redemption / Moon Black Goofy (HK)/stimulation 1995 (set) [Chinese title] Shawshank Redemption [English title]the Sha Wshank redemption[score]9.6[Comments]761249[Review] hope to make people free. 2 This killer's not too cold. / léon / killer Leon/Ultimate Kill Order (set) [Chinese title] This killer is not too cold [English title]léon[score]9.4[Comments]731250[review] strange sorghum and little Lori had to tell the story. 3 Farewell My Concubine / goodbye, My concubine/Farewell My Concubine [Chinese title] farewell My Concubine [score]9.5[Comment]535808[film review] Festival. 4 Forrest Gump / forrest Gump / Foreste Gamps [Chinese title] Forrest Gump [English title]forrest gump[score]9.4[Comments]633434[reviews] A modern American history. 5 Beautiful Life / la Vitaèbella / a happy Tale (Hong Kong)/life is Beautiful [Chinese title] Beautiful lives [English title]la vitaèbella[score]9.5[ Commentary]364132[review] the most beautiful lies. 6,000 and Chihiro / thousand と unseen し / god Hidden Maiden (Taiwan)/Spirited Away [Chinese title] Thousand and Chihiro [English title] thousand と thousand gods unseen し[score]9.2[Comments]584559[Review] the best Gongqi June , the best Hisaishi.
also output the file Reseult_douban.txt as shown in:
three. Selenium crawl information and chrome crawler introduction
Get started my previous article:[python crawler] Selenium automatic login and locating elements Introduction
The code looks like this:
#-*-Coding:utf-8-*-"" "Created on 2016-12-29 22:50@author:easstmount" "" Import time Import re Import sys import codecs import urllib from selenium import webdriver from Selenium.webdriver.common.key S import Keys #爬虫函数def crawl (URL): Driver.get (URL) print u ' watercress movie 250: serial number \ t movie name \ t rating \ t evaluation number ' Infofile.write (u "watercress movie 250: Serial number \ t film name \ t rating \ r \ n") print U ' Crawl information is as follows: \ n ' content = Driver.find_elements_by_xpath ("//div[@class = ' ite M '] ") for tag in Content:print tag.text infofile.write (tag.text+" \ r \ n ") print ' #主函数 if __name__ = = ' __main__ ': Driver = webdriver. Firefox () Infofile = Codecs.open ("Result_douban.txt", ' a ', ' utf-8 ') url = ' Http://movie.douban.com/top250?forma T=text ' i = 0 while i<10:print u ' page number ', (i+1) num = i*25 #每次显示25部 URL ordinal by 25 add url = ' https://m ovie.douban.com/top250?start= ' + str (num) + ' &filter= ' crawl (URL) infofile.writE ("\r\n\r\n\r\n") i = i + 1 infofile.close ()
This part of the code automatically calls the Firefox browser and then crawls the content. The call is as follows:
at the same time, you can also crawl files as shown, you can also redirect the analysis of specific nodes, the way is similar.
Calling a Chrome browser requires:
C:\Program Files (x86) \google\chrome\application
A chromedriver.exe driver file is placed under the path and then called. Core code:
Chromedriver = "C:\Program Files (x86) \google\chrome\application\chromedriver.exe" os.environ[" Webdriver.chrome.driver "] = chromedriver Driver = webdriver. Chrome (Chromedriver)
However, the error may be as follows, and you need to keep the version consistent.
Summarize the advantages and disadvantages of the next two code: BeautifulSoup faster, more perfect structure, but crawl such as CSDN and other blogs will error forbidden, and selenium can call the browser to crawl, automatic operation and dynamic operation, click the mouse keyboard and other buttons more convenient, But it is slower, especially for repeated calls to the browser.
recent End-of-schoolToo many things, so few quantitative time to write a blog, which is actually quite sad, but fortunately met her, let me in a busy schedule or experience some sweet, accompany me to work.
Needless to say, each other's mind deeds can feel the love and warmth, follow you~
(By:eastmount 2016-12-30 night 12:30 http://blog.csdn.net/eastmount/ )
[python crawler] BeautifulSoup and selenium to crawl watercress Top250 movie info