Use the Python urllib and urllib2 modules to create a crawler instance tutorial, urlliburllib2
Urllib
I am confused about the basics of learning python. the eyes closed, and a blank suffocation continued. there is still a lack of exercises, so I use crawlers to train my hands. after learning the Sparta python crawler course, I will organize my experiences as follows for subsequent review. the entire note consists of the following parts:
- 1. Create a simple Crawler
- 2. Try it out-capture pictures of Baidu Post bars
- 3. Summary
1. Create a simple Crawler
Environment Description
- Device: Mbit 2012 Yosemite 10.10.1
- Python: python 2.7.9
- Editor: Sublime Text 3
There is nothing to say about this. Just go to the code!
''' @ Urllib is a method for python's built-in network library @ urlopen to urllib. It is used to open a connection and capture webpages, and then read () method to assign the value to read () ''' import urlliburl = "http://www.lifevc.com", to choose lifev, the most recent is very annoying .html = urllib. urlopen (url) content = html. read () html. close () # print the webpage content using print
It's very simple. Basically there is nothing to say, that is, the charm of python. Just a few lines of code will be done.
Of course, we only crawl web pages, and there is no real value. Next we will start to do something meaningful.
2. Test the knife
Capture pictures of Baidu Post bars
In fact, it is also very simple, because to capture images, You need to analyze the source code of the web page first
(Here we will know the basic html knowledge. chrome is used as an example in the browser)
, Here is a brief description of the next step, please refer.
Open the webpage, right click, and select "inspect Element" (the bottom item)
Click the question mark on the far left of the box that pops up below. The question mark turns blue.
Move the mouse and click the image we want to capture (a cute girl)
Then we can position the image in the source code.
Copy the source code
After analysis and comparison (omitted here), we can basically see several features of the image to be captured:
- Under the img label
- Under the class BDE_Image
- The image format is jpg.
I will update the regular expression later. Please note
Based on the above judgment, go directly to the code
''' @ This program is used to download Baidu Post Bar image @ re for regular instructions library ''' import urllibimport re # Get web page html url = "http://tieba.baidu.com/p/2336739808" html = urllib. urlopen (url) content = html. read () html. close () # use regular expressions to match image features and obtain the image link img_tag = re. compile (r'class = "BDE_Image" src = "(. +? \. Jpg) "') img_links = re. findall (img_tag, content) # download the image img_counter as the image counter (File Name) img_counter = 0for img_link in img_links: img_name = 'invalid s.jpg '% img_counter urllib. urlretrieve (img_link, "// Users // Sean // Downloads // tieba // % s" % img_name) img_counter + = 1
You can capture the images you know.
3. Summary
In the above two sections, we can easily access webpages or images.
TIPS: If you encounter a library or method that is not very clear, you can use the following methods for preliminary understanding.
- Dir (urllib) # view the methods of the current library
- Help (urllib. urlretrieve) # view the functions or parameters related to the current method, which is officially authoritative.
You can also search for related information at https://docs.python.org/2/library/index.html.
Of course Baidu can also, but the efficiency is too low. It is recommended to use http://xie.lu for related search (You know, absolutely satisfied ).
Here we will explain how to capture webpages and download images. Below we will explain how to capture websites with limited crawling.
Urllib2
The following section describes how to capture webpages and download images.
First, we still use the method of the previous lesson to capture a website <blog.cndn.net> that everyone uses for example. This article mainly consists of the following parts:
- 1. Capture restricted webpages
- 2. Optimize the code
1. Capture restricted webpages
First, test the knowledge we learned in the previous section:
''' @ This program is used to capture the blog.csdn.net webpage ''' import urlliburl = "http://blog.csdn.net/FansUnion" html = urllib. urlopen (url) # The getcode () method returns the Http status code print html. getcode () html. close () # output
403
Here, the output is 403, indicating that access is denied. Similarly, 200 indicates that the request is successfully completed. 404 indicates that the URL is not found.
It can be seen that csdn has been blocked. The first method is that the webpage cannot be obtained. here we need to start a new library: urllib2
However, we can also see that the browser can send that text. If we simulate browser operations, we can get web page information.
The old method is as follows:
- Open the webpage, right click, and select "inspect Element" (the bottom item)
- Click the Network tab of the box that pops up below
- Refresh the page and you will see that the Network tab captures a lot of information.
- Find one of the information to expand the Header of the Request package.
The following is the sorted Header information.
Request Method:GETHost:blog.csdn.netReferer:http://blog.csdn.net/?ref=toolbar_logoUser-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36
Then, based on the extracted Header information, use the urllib2 Request method to simulate a browser to submit a Request to the server. The Code is as follows:
# Coding = UTF-8 ''' @ this program is used to capture restricted webpages (blog.csdn.net) @ User-Agent: client browser version @ Host: server address @ Referer: Jump address @ GET: request Method is GET '''import urllib2url = "http://blog.csdn.net/FansUnion" # custom Header, simulate the browser to submit request req = urllib2.Request (url) req to the server. add_header ('user-agent', 'mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/100') req. add_header ('host', 'blog .csdn.net ') req. add_header ('Referer', 'HTTP: // blog.csdn.net ') req. add_header ('get', url) # download the webpage html and print html = urllib2.urlopen (req) content = html. read () print contenthtml. close ()
If you restrict me, I will skip your restriction. It is said that as long as the browser can access it, it will be able to crawl through crawlers.
2. Optimize the code
Simplify the Header submission method
I found that writing so many req. add_header each time is a torment for myself. Is there any way to use it as long as it is copied? The answer is yes.
# Input: help (urllib2.Request) # output (only the _ init _ method) _ init _ (self, url, data = None, headers = {}, origin_req_host = None, unverifiable = False) through observation, we found that headers = {}, that is, header information can be submitted in a dictionary. try it !! # Only code csdn_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/39.0.2171.95 Safari/537.36 "," Host ":" blog.csdn.net ", 'Referer': 'http: // blog.csdn.net '," GET ": url} req = urllib2.Request (url, headers = csdn_headers)
The discovery is not very simple. Here, I would like to thank Sparta for its selfless inspiration.
Provide dynamic header information
In most cases, the submitted information is too simple and is rejected by the server as a web crawler.
Is there some more intelligent way to submit some dynamic data? The answer is certainly yes, and it is very simple. Go directly to the code!
''' @ This program is used to dynamically submit Header information @ random Dynamic library, For details, see
In fact, it is very simple, so we have completed some code optimization.
Articles you may be interested in:
- In Python, The urllib + urllib2 + cookielib module write crawler practices
- In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
- Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
- Use regular expressions to search for Python crawler package Beautiful Soup
- Some key points of using the Beautiful Soup package to write crawlers in Python
- Create a crawler to capture beautiful pictures in Python
- How to Write a Python crawler to capture TOP100 Douban movies and user portraits
- Demonstrate the usage of the Python crawler Beautiful Soup using video crawling instances
- Guide to Using Python to write basic crawler modules and frameworks