This is a Python crawler for small white free teaching course, only 7 section, let the zero basis of your initial understanding of the crawler, followed by the course content to crawl resources. Look at the article, open the computer hands-on practice, an average of 45 minutes to learn a section, if you want, you can enter the reptile gate today ~ Well, formally began our second lesson, "crawl the Watercress Movie information" It! Cheer up, look at the blackboard. Crawler principle 1.1 Crawler Fundamentals listen to so many reptiles, in the end what is a reptile? How does a reptile work? Let's start with the "reptile principle". Crawler, also known as Web Spider, is a program or a script. But the point is: it is able to automatically obtain web information according to certain rules. The general framework of the crawler is as follows: 1. Select the seed url;2. Place these URLs in the URL queue to be crawled; 3. Remove the URL to be crawled, download and store it in the downloaded Web page library. In addition, these URLs are placed in the queue of URLs to be crawled, into the next loop; 4. Parse the URL in the crawled queue and place the URL in the queue of the URL to crawl to enter the next loop.
Cough and cough ~
Or a concrete example, to illustrate it!
1.2 A reptile example
Crawler access to web information and manual access to information, in fact, the principle is consistent, such as we want to get the film "Scoring" information:
Manual steps: Get the page positioning of the movie information (find) to the location of the scoring information copy, save the score data we want to crawl operation step: Request and download the movie page information analysis and location scoring information to save the scoring data feeling is not very similar? 1.3 Crawler basic flow Simply put, after we send a request to the server, we get the returned page, and by parsing the page we can extract the part of the information we want and store it in the specified document or database. In this way, we are "crawling" the information we want. Requests+xpath Crawl in the Watercress movie Python-related packages are many: Urllib, requsts, BS4 ... We start with Requests+xpath because it's so easy to get started! After studying you will find that BeautifulSoup is still a little bit difficult. Here we use Requests+xpath to crawl the Watercress movie: 2.1 Install the Python app package: Requests, lxml if it's the first time using Requests+xpath, you first need to install two packages: Requests and lxml, Enter the following two lines of code in the terminal (the installation method has been mentioned in section 1th):
PIP Install requests
Pip Install lxml
2.2 Import the Python module we need we write the code in Jupyter and first import the two modules we need: Import requests
From lxml import etree
In Python, importing a library directly with the "import+ Library name" requires using a method named "from+ Library name +import+" in the library. Here we need requests to download the webpage, use Lxml.etree to parse the webpage. 2.3 Get the Watercress Movie landing page and analyze some of the information that we want to crawl in the Watercress movie "Shawshank Redemption," the website address is: https://movie.douban.com/subject/1292052/
Given the URL and use the Requests.get () method to get the text of the page, with Etree. HTML () to parse the downloaded page "data". url = ' https://movie.douban.com/subject/1292052/'
data = Requests.get (URL). text
S=etree. HTML (data)
2.4 Get the movie name gets the XPath information of the element and obtains the text: File=s.xpath (' element's XPath Information/text () ')
Here the "element of the XPath information" is required to be obtained manually, get the way: Locate the target element, click on the site: right > Check
br/> shortcut key "Shift+ctrl+c", move the mouse to the corresponding element when you can see the corresponding page code:
! [] (http://i2.51cto.com/images/blog/201804/26/5e2eddc1caa0a6a405f762c444761fc2.jpg?x-oss-process=image/ watermark,size_16,text_qduxq1rp5y2a5a6i,color_ffffff,t_100,g_se,x_10,y_10,shadow_90,type_zmfuz3pozw5nagvpdgk=)
So we copy the XPath information from the element://*[@id = "Content"]/h1/span[1]
Print (film)
2.5 Code and running results The complete code is as follows: Import requests
From lxml import etree
url = ' https://movie.douban.com/subject/1292052/'
data = Requests.get (URL). text
S=etree. HTML (data)
Film=s.xpath ('//*[@id = ' content ']/h1/span[1]/text () ')
Print (film)
Run the full code in Jupyter and the results are as follows:
At this point, we have completed the crawl of the Watercress film "Shawshank Redemption" in the "movie name" information code written, you can run in Jupyter. 2.6 Getting other element information in addition to the name of the movie, we can also obtain the Director, starring, film Long and other information, the way is similar. The code is as follows: Director=s.xpath ('//[@id = ' info ']/span[1]/span[2]/a/text () ') #导演
Actor1=s.xpath ('//[@id = ' info ']/span[3]/span[2]/a[1]/text () ') #主演1
Actor2=s.xpath ('//[@id = ' info ']/span[3]/span[2]/a[2]/text () ') #主演2
Actor3=s.xpath ('//[@id = ' info ']/span[3]/span[2]/a[3]/text () ') #主演3
Time=s.xpath ('//[@id = ' info ']/span[13]/text () ') #电影片长
Looking at the code above, the difference between "X" in the "a[x" is different when you get Different "starring" messages. In fact, to get all the "starring" information at once, use the non-numeric "a" to indicate. The code is as follows: Actor=s.xpath ('//[@id = ' info ']/span[3]/span[2]/a/text () ') #主演
The complete code is as follows: Import requests
From lxml import etree
url = ' https://movie.douban.com/subject/1292052/'
data = Requests.get (URL). text
S=etree. HTML (data)
Film=s.xpath ('//[@id = "Content"]/h1/span[1]/text () ')
Director=s.xpath ('///[@id = ' info ']/span[1]/span[2]/a/text () ') Br/>actor=s.xpath ('//*[@id = ' Info ']/span[3]/ Span[2]/a/text () ')
Print (' Movie Name: ', film)
Print (' Director: ', director)
Print (' Starring: ', actor)
Print (' Piece length: ', Time)
Run the full code in Jupyter and the results are as follows:
- About Requestsrequests Library Official introduction There is such a sentence: Requests the only non-GMO Python HTTP Library, humans can be safe to enjoy. This statement directly and domineering declared that the requests library is Python's best http library. Why does it have such a clout? Please read the Requests official documentation if you are interested. Requests commonly used in seven ways:
- About parsing artifact Xpathxpath is the XML Path Language (XML pathname Language), which is a language used to determine the location of a part of an XML document. Xpath is an XML-based tree structure that provides the ability to find nodes in a data structure tree. At first, the intention of Xpath was to use it as a common grammatical model between Xpointer and XSL. But XPath is quickly used by developers as a small query language. You can read this document for more information on Xpath. The process of parsing Web pages by Xpath: 1. First obtain the Web page data through the Requests Library 2. Get the data you want or the new link 3 through the Web page parsing. Web parsing can be done through XPath or other parsing tools, XPath is a very useful web parsing tool
A comparison of common Web page parsing methods
Regular expressions are more difficult to use and higher learning costs
BeautifulSoup slow performance, harder than Xpath, useful in certain scenarios
XPath is easy to use and fast (XPath is one of the lxml) and is the best choice for getting Started
All right, here's the lesson!
Getting started with Python crawlers | Crawl the Watercress Movie information