Python Development Simple crawler (i)

Source: Internet
Author: User

First, simple crawler architecture:

Crawler Scheduler: Start the crawler, stop the crawler, monitor the operation of the crawler
URL Manager: Manages the URLs that will be crawled and crawled, and can take a crawled URL and pass it to the Web page downloader
Web page downloader: Download the URL specified page, store it as a string, and transfer it to the "Web parser"
Web parser: Parse a webpage to parse out ① valuable data ② on the other hand, each page contains a URL to another page, parsed out can be added to the "URL manager", and constantly cycle.

Second, the dynamic operation flow of simple crawler architecture

Third, crawler URL management

URL Manager: Manages the collection of crawled URLs and crawled URL collections to prevent duplicate crawls.

URL: Add a new URL to the crawl collection, determine whether the URL to be added is in the container, determine if the URL is still to be crawled, get the URL to crawl, move the URL from the crawl to the crawled

Iv. How to implement the crawler URL Manager

Three ways to implement the URL Manager: memory, relational database, cache database

Storing in memory is the use of set () sets, which can be used to remove duplicate elements, using the is_crawled parameter in MySQL to mark crawled or not, and the Redis database also leverages the set collection.

Five, crawler Web downloader

Python Web Downloader has:
1) urllib2 (Python official basic module) Python 3.x urllib library and URILIB2 Library are merged into the Urllib library.

2) Requests (third party package, more powerful)

Crawler urlib2 three ways to download Web pages:

Method One: Pass the URL directly to the Urlopen () method of the Urllib2 module

Method two: Share processing, add data, HTTP header

You can pass the Url,data,header three parameters to the request class of URLLIB2, generate the Resquest object, and transfer the Resquest object as a parameter to the Urlopen () method to send a Web request

Method Three: Add a processor for a special scenario

Some Web pages require users to log in to access, at this time to add the processing of cookies, use httpcookieprocessor for processing

Some Web pages require a proxy for access, and are processed using Proxyhandler

Some Web pages use HTTPS protocol encrypted access, then use Httpshandler for processing

There are other URLs in the Web page that have an automatic jump relationship with each other, then use Httpredirecthandler for processing

These handler can be transmitted to the Build_opener () method to generate a opener object, and then to urlib2 install the opener, and then use the Urlopen () method to send Web requests, to achieve the download of the Web page

Example: enhancing the processing of cookies

Vi. Web Downloader urllib2 Three methods instance code

Method One:

Import Urllib2 # -*-coding:utf-8-*- # The first of these methods ' http://www.baidu.com '  = urllib2.urlopen (URL) print response1.getcode ()       # Prints the status code, Determine if the request was successful  print len (Response1.read ())     # Prints the length of the page content returned

Method Two:

#-*-coding:utf-8-*-ImportUrllib2url='http://www.baidu.com'Request=Urllib2. Request (URL) request.add_header ("user-agent","mozilla/5.0")#added HTTP header information to the request object and disguised the crawler as a Mozilla browserResponse2 =Urllib2.urlopen (Request)PrintResponse2.getcode ()PrintLen (Response2.read ())

Method Three:

First create a cookie container for CJ,

Then create a opener, which is the container as a parameter through URLLIB2. The Httpcookieprocessor () method creates a handler, which is then passed to the Build_opener () method implementation,

Installs opener for Urllib2, at this time URLLIB2 has the enhancement ability of the cookie processing,

Finally, use the Urlopen () method to access the normal URL

Finally, you can print out the contents of the cookie and the content of the webpage

Import Urllib2 Import  'http://www.baidu.com'== = Urllib2.urlopen ( URL)print  response3.getcode ()print  CJPrint Response3.read ()

Seven, crawler web parser

Web parser: Tools to extract valuable data from Web pages

Python four ways to parse Web pages:
Fuzzy matching:
1. Regular expressions (the most intuitive way to view a Web page or document as a string that extracts valuable data in a fuzzy match, but is cumbersome if the document is more complex)
Structured parsing: (the entire page will be loaded into a DOM tree, in the form of a tree to traverse and access the hierarchy)
2.html.parser; (Python module)
3.BeautifulSoup, (third-party plug-ins, can use Html.parser and lxml as the parser)

4.lxml; (third-party plug-ins, which can parse HTML pages or XML pages)

Viii. Use of BEAUTIFULSOUP4

BeautifulSoup is a third-party library of Python that extracts data from HTML or XML

Install BEAUTIFULSOUP4:

1) Open cmd Command window

2) Enter the Python installation directory under the scripts

CD C:\Python27\Scripts

3) Enter Dir, display Pip.exe installed

4) Enter PIP install Beautifulsoup4 to install BEAUTIFULSOUP4

Syntax for BeautifulSoup:

1) Create a BeautifulSoup object based on the downloaded HTML page, and when the object is created, the entire document is loaded into a DOM tree.

2) Then the node can be searched based on this DOM, and there are two ways to search for nodes:

The Find_all method and the Find method, the parameters of the two methods are the same, the Find_all method will search out all the nodes that meet the requirements, the Find method will only search for the first node that satisfies the requirements,

When searching for a node, you can search by name, attribute, or text of the node.

3) Once you have the node, you can access the node's name, attributes, and content text.

Example: The following a link can be searched and accessed in three different ways

The code is as follows:

1. Create the BeautifulSoup object and load the DOM tree at the same time

#-*-coding:utf-8-*- fromBs4ImportBeautifulSoup#Create BeautifulSoup objects based on HTML page stringsSoup =BeautifulSoup (Html_doc,#HTML document String, refers to the use of the Web page downloader download good HTML file or there is a local HTML file, is to be assigned in advance                    'Html.parser',          #HTML ParserFrom_encoding ='Utf-8' #encoding of HTML documents)

2. Search node (find_all,find)

#method: Find_all (name,attrs,string)#Find all nodes labeled aSoup.find_all ('a')#Find all tags as a, link to nodes in/view/123.htm formSoup.find_all ('a', href='/view/123.htm') Soup.find_all ('a', Href=re.compile (R'/view/\d+\.htm'))#you can use regular expressions to match#Find all nodes labeled Div,class as ABC, text pythonSoup.find_all ('Div', class_='ABC', string='Python')#because class is the keyword of Python, in order to avoid conflicts using Class_

3. Accessing the node's information

# Example: Get node <a href= ' 1.html ' >Python</a> # gets the label name of the node to find Node.name # gets the href attribute of the A node found node['href']#  Gets the link text of a node found to Node.get_text ()

BeautifulSoup Example Demo:

#-*-coding:utf-8-*-ImportOSImportRe fromBs4ImportBeautifulsouphtml_doc=""""""Print 'get all the A links:'Soup= BeautifulSoup (Html_doc,'Html.parser', from_encoding='Utf-8') Links= Soup.find_all ('a') forLinkinchLinks:Printlink.name,link['href'],link.get_text ()Print 'get a link to Lacie:'Link_node1= Soup.find ('a', href='Http://example.com/lacie')Printlink_node1.name,link_node1['href'],link_node1.get_text ()Print 'use regular expression matching:'Link_node2= Soup.find ('a', Href=re.compile (R"ill"))Printlink_node2.name,link_node2['href'],link_node2.get_text ()Print 'gets the text for the specified p paragraph:'P_node= Soup.find ('P', class_='title')PrintP_node.name,p_node.get_text ()

Output Result:

Python Development Simple crawler (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.