First, simple crawler architecture:
Crawler Scheduler: Start the crawler, stop the crawler, monitor the operation of the crawler
URL Manager: Manages the URLs that will be crawled and crawled, and can take a crawled URL and pass it to the Web page downloader
Web page downloader: Download the URL specified page, store it as a string, and transfer it to the "Web parser"
Web parser: Parse a webpage to parse out ① valuable data ② on the other hand, each page contains a URL to another page, parsed out can be added to the "URL manager", and constantly cycle.
Second, the dynamic operation flow of simple crawler architecture
Third, crawler URL management
URL Manager: Manages the collection of crawled URLs and crawled URL collections to prevent duplicate crawls.
URL: Add a new URL to the crawl collection, determine whether the URL to be added is in the container, determine if the URL is still to be crawled, get the URL to crawl, move the URL from the crawl to the crawled
Iv. How to implement the crawler URL Manager
Three ways to implement the URL Manager: memory, relational database, cache database
Storing in memory is the use of set () sets, which can be used to remove duplicate elements, using the is_crawled parameter in MySQL to mark crawled or not, and the Redis database also leverages the set collection.
Five, crawler Web downloader
Python Web Downloader has:
1) urllib2 (Python official basic module) Python 3.x urllib library and URILIB2 Library are merged into the Urllib library.
2) Requests (third party package, more powerful)
Crawler urlib2 three ways to download Web pages:
Method One: Pass the URL directly to the Urlopen () method of the Urllib2 module
Method two: Share processing, add data, HTTP header
You can pass the Url,data,header three parameters to the request class of URLLIB2, generate the Resquest object, and transfer the Resquest object as a parameter to the Urlopen () method to send a Web request
Method Three: Add a processor for a special scenario
Some Web pages require users to log in to access, at this time to add the processing of cookies, use httpcookieprocessor for processing
Some Web pages require a proxy for access, and are processed using Proxyhandler
Some Web pages use HTTPS protocol encrypted access, then use Httpshandler for processing
There are other URLs in the Web page that have an automatic jump relationship with each other, then use Httpredirecthandler for processing
These handler can be transmitted to the Build_opener () method to generate a opener object, and then to urlib2 install the opener, and then use the Urlopen () method to send Web requests, to achieve the download of the Web page
Example: enhancing the processing of cookies
Vi. Web Downloader urllib2 Three methods instance code
Method One:
Import Urllib2 # -*-coding:utf-8-*- # The first of these methods ' http://www.baidu.com ' = urllib2.urlopen (URL) print response1.getcode () # Prints the status code, Determine if the request was successful print len (Response1.read ()) # Prints the length of the page content returned
Method Two:
#-*-coding:utf-8-*-ImportUrllib2url='http://www.baidu.com'Request=Urllib2. Request (URL) request.add_header ("user-agent","mozilla/5.0")#added HTTP header information to the request object and disguised the crawler as a Mozilla browserResponse2 =Urllib2.urlopen (Request)PrintResponse2.getcode ()PrintLen (Response2.read ())
Method Three:
First create a cookie container for CJ,
Then create a opener, which is the container as a parameter through URLLIB2. The Httpcookieprocessor () method creates a handler, which is then passed to the Build_opener () method implementation,
Installs opener for Urllib2, at this time URLLIB2 has the enhancement ability of the cookie processing,
Finally, use the Urlopen () method to access the normal URL
Finally, you can print out the contents of the cookie and the content of the webpage
Import Urllib2 Import 'http://www.baidu.com'== = Urllib2.urlopen ( URL)print response3.getcode ()print CJPrint Response3.read ()
Seven, crawler web parser
Web parser: Tools to extract valuable data from Web pages
Python four ways to parse Web pages:
Fuzzy matching:
1. Regular expressions (the most intuitive way to view a Web page or document as a string that extracts valuable data in a fuzzy match, but is cumbersome if the document is more complex)
Structured parsing: (the entire page will be loaded into a DOM tree, in the form of a tree to traverse and access the hierarchy)
2.html.parser; (Python module)
3.BeautifulSoup, (third-party plug-ins, can use Html.parser and lxml as the parser)
4.lxml; (third-party plug-ins, which can parse HTML pages or XML pages)
Viii. Use of BEAUTIFULSOUP4
BeautifulSoup is a third-party library of Python that extracts data from HTML or XML
Install BEAUTIFULSOUP4:
1) Open cmd Command window
2) Enter the Python installation directory under the scripts
CD C:\Python27\Scripts
3) Enter Dir, display Pip.exe installed
4) Enter PIP install Beautifulsoup4 to install BEAUTIFULSOUP4
Syntax for BeautifulSoup:
1) Create a BeautifulSoup object based on the downloaded HTML page, and when the object is created, the entire document is loaded into a DOM tree.
2) Then the node can be searched based on this DOM, and there are two ways to search for nodes:
The Find_all method and the Find method, the parameters of the two methods are the same, the Find_all method will search out all the nodes that meet the requirements, the Find method will only search for the first node that satisfies the requirements,
When searching for a node, you can search by name, attribute, or text of the node.
3) Once you have the node, you can access the node's name, attributes, and content text.
Example: The following a link can be searched and accessed in three different ways
The code is as follows:
1. Create the BeautifulSoup object and load the DOM tree at the same time
#-*-coding:utf-8-*- fromBs4ImportBeautifulSoup#Create BeautifulSoup objects based on HTML page stringsSoup =BeautifulSoup (Html_doc,#HTML document String, refers to the use of the Web page downloader download good HTML file or there is a local HTML file, is to be assigned in advance 'Html.parser', #HTML ParserFrom_encoding ='Utf-8' #encoding of HTML documents)
2. Search node (find_all,find)
#method: Find_all (name,attrs,string)#Find all nodes labeled aSoup.find_all ('a')#Find all tags as a, link to nodes in/view/123.htm formSoup.find_all ('a', href='/view/123.htm') Soup.find_all ('a', Href=re.compile (R'/view/\d+\.htm'))#you can use regular expressions to match#Find all nodes labeled Div,class as ABC, text pythonSoup.find_all ('Div', class_='ABC', string='Python')#because class is the keyword of Python, in order to avoid conflicts using Class_
3. Accessing the node's information
# Example: Get node <a href= ' 1.html ' >Python</a> # gets the label name of the node to find Node.name # gets the href attribute of the A node found node['href']# Gets the link text of a node found to Node.get_text ()
BeautifulSoup Example Demo:
#-*-coding:utf-8-*-ImportOSImportRe fromBs4ImportBeautifulsouphtml_doc=""""""Print 'get all the A links:'Soup= BeautifulSoup (Html_doc,'Html.parser', from_encoding='Utf-8') Links= Soup.find_all ('a') forLinkinchLinks:Printlink.name,link['href'],link.get_text ()Print 'get a link to Lacie:'Link_node1= Soup.find ('a', href='Http://example.com/lacie')Printlink_node1.name,link_node1['href'],link_node1.get_text ()Print 'use regular expression matching:'Link_node2= Soup.find ('a', Href=re.compile (R"ill"))Printlink_node2.name,link_node2['href'],link_node2.get_text ()Print 'gets the text for the specified p paragraph:'P_node= Soup.find ('P', class_='title')PrintP_node.name,p_node.get_text ()
Output Result:
Python Development Simple crawler (i)