Use the Python urllib and urllib2 modules to create a crawler instance tutorial, urlliburllib2

Last Update:2016-01-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Urllib
I am confused about the basics of learning python. the eyes closed, and a blank suffocation continued. there is still a lack of exercises, so I use crawlers to train my hands. after learning the Sparta python crawler course, I will organize my experiences as follows for subsequent review. the entire note consists of the following parts:

1. Create a simple Crawler
2. Try it out-capture pictures of Baidu Post bars
3. Summary

1. Create a simple Crawler
Environment Description

Device: Mbit 2012 Yosemite 10.10.1
Python: python 2.7.9
Editor: Sublime Text 3

There is nothing to say about this. Just go to the code!

''' @ Urllib is a method for python's built-in network library @ urlopen to urllib. It is used to open a connection and capture webpages, and then read () method to assign the value to read () ''' import urlliburl = "http://www.lifevc.com", to choose lifev, the most recent is very annoying .html = urllib. urlopen (url) content = html. read () html. close () # print the webpage content using print

It's very simple. Basically there is nothing to say, that is, the charm of python. Just a few lines of code will be done.
Of course, we only crawl web pages, and there is no real value. Next we will start to do something meaningful.

2. Test the knife
Capture pictures of Baidu Post bars
In fact, it is also very simple, because to capture images, You need to analyze the source code of the web page first
(Here we will know the basic html knowledge. chrome is used as an example in the browser)
, Here is a brief description of the next step, please refer.

Open the webpage, right click, and select "inspect Element" (the bottom item)
Click the question mark on the far left of the box that pops up below. The question mark turns blue.
Move the mouse and click the image we want to capture (a cute girl)
Then we can position the image in the source code.

Copy the source code

After analysis and comparison (omitted here), we can basically see several features of the image to be captured:

Under the img label
Under the class BDE_Image
The image format is jpg.

I will update the regular expression later. Please note

Based on the above judgment, go directly to the code

''' @ This program is used to download Baidu Post Bar image @ re for regular instructions library ''' import urllibimport re # Get web page html url = "http://tieba.baidu.com/p/2336739808" html = urllib. urlopen (url) content = html. read () html. close () # use regular expressions to match image features and obtain the image link img_tag = re. compile (r'class = "BDE_Image" src = "(. +? \. Jpg) "') img_links = re. findall (img_tag, content) # download the image img_counter as the image counter (File Name) img_counter = 0for img_link in img_links: img_name = 'invalid s.jpg '% img_counter urllib. urlretrieve (img_link, "// Users // Sean // Downloads // tieba // % s" % img_name) img_counter + = 1

You can capture the images you know.

3. Summary
In the above two sections, we can easily access webpages or images.
TIPS: If you encounter a library or method that is not very clear, you can use the following methods for preliminary understanding.

Dir (urllib) # view the methods of the current library
Help (urllib. urlretrieve) # view the functions or parameters related to the current method, which is officially authoritative.

You can also search for related information at https://docs.python.org/2/library/index.html.

Of course Baidu can also, but the efficiency is too low. It is recommended to use http://xie.lu for related search (You know, absolutely satisfied ).
Here we will explain how to capture webpages and download images. Below we will explain how to capture websites with limited crawling.

Urllib2
The following section describes how to capture webpages and download images.
First, we still use the method of the previous lesson to capture a website <blog.cndn.net> that everyone uses for example. This article mainly consists of the following parts:

1. Capture restricted webpages
2. Optimize the code

1. Capture restricted webpages

First, test the knowledge we learned in the previous section:

''' @ This program is used to capture the blog.csdn.net webpage ''' import urlliburl = "http://blog.csdn.net/FansUnion" html = urllib. urlopen (url) # The getcode () method returns the Http status code print html. getcode () html. close () # output

Here, the output is 403, indicating that access is denied. Similarly, 200 indicates that the request is successfully completed. 404 indicates that the URL is not found.
It can be seen that csdn has been blocked. The first method is that the webpage cannot be obtained. here we need to start a new library: urllib2
However, we can also see that the browser can send that text. If we simulate browser operations, we can get web page information.
The old method is as follows:

Open the webpage, right click, and select "inspect Element" (the bottom item)
Click the Network tab of the box that pops up below
Refresh the page and you will see that the Network tab captures a lot of information.
Find one of the information to expand the Header of the Request package.

The following is the sorted Header information.

Request Method:GETHost:blog.csdn.netReferer:http://blog.csdn.net/?ref=toolbar_logoUser-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36

Then, based on the extracted Header information, use the urllib2 Request method to simulate a browser to submit a Request to the server. The Code is as follows:

# Coding = UTF-8 ''' @ this program is used to capture restricted webpages (blog.csdn.net) @ User-Agent: client browser version @ Host: server address @ Referer: Jump address @ GET: request Method is GET '''import urllib2url = "http://blog.csdn.net/FansUnion" # custom Header, simulate the browser to submit request req = urllib2.Request (url) req to the server. add_header ('user-agent', 'mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/100') req. add_header ('host', 'blog .csdn.net ') req. add_header ('Referer', 'HTTP: // blog.csdn.net ') req. add_header ('get', url) # download the webpage html and print html = urllib2.urlopen (req) content = html. read () print contenthtml. close ()

If you restrict me, I will skip your restriction. It is said that as long as the browser can access it, it will be able to crawl through crawlers.

2. Optimize the code
Simplify the Header submission method
I found that writing so many req. add_header each time is a torment for myself. Is there any way to use it as long as it is copied? The answer is yes.

# Input: help (urllib2.Request) # output (only the _ init _ method) _ init _ (self, url, data = None, headers = {}, origin_req_host = None, unverifiable = False) through observation, we found that headers = {}, that is, header information can be submitted in a dictionary. try it !! # Only code csdn_headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/39.0.2171.95 Safari/537.36 "," Host ":" blog.csdn.net ", 'Referer': 'http: // blog.csdn.net '," GET ": url} req = urllib2.Request (url, headers = csdn_headers)

The discovery is not very simple. Here, I would like to thank Sparta for its selfless inspiration.

Provide dynamic header information
In most cases, the submitted information is too simple and is rejected by the server as a web crawler.
Is there some more intelligent way to submit some dynamic data? The answer is certainly yes, and it is very simple. Go directly to the code!

''' @ This program is used to dynamically submit Header information @ random Dynamic library, For details, see 
In fact, it is very simple, so we have completed some code optimization.
Articles you may be interested in:

 
 
  In Python, The urllib + urllib2 + cookielib module write crawler practices
 
  In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
 
  Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
 
  Use regular expressions to search for Python crawler package Beautiful Soup
 
  Some key points of using the Beautiful Soup package to write crawlers in Python
 
  Create a crawler to capture beautiful pictures in Python
 
  How to Write a Python crawler to capture TOP100 Douban movies and user portraits
 
  Demonstrate the usage of the Python crawler Beautiful Soup using video crawling instances
 
  Guide to Using Python to write basic crawler modules and frameworks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use the Python urllib and urllib2 modules to create a crawler instance tutorial, urlliburllib2

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use the Python urllib and urllib2 modules to create a crawler instance tutorial, urlliburllib2

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support