In Python, The urllib + urllib2 + cookielib module write crawler practices

Source: Internet
Author: User
Tags website server
This article mainly introduces Python's urllib + urllib2 + cookielib module crawler practices. This article provides examples of crawling to capture Douban and log on to the library to query the return of books, you can refer to the hypertext transfer protocol http to form the foundation of the world wide web. It uses URI (unified resource identifier) to identify data on the Internet, the URI of the specified document address is called a URL (Uniform Resource Locator). Common URLs point to files, directories, or objects that execute complex tasks (such as database searches and internet searches ), in essence, crawlers access and operate these URLs to obtain the desired content. If you want to write crawlers without business requirements, you can use the urllib, urllib2, and cookielib modules.
Urllib2 is not an upgraded version of urllib. Although it is also used as a url-related module, we recommend that you use the urllib2 interface as much as possible, but we cannot use urllib2 instead of urllib, processing URL resources sometimes requires functions in urllib (such as urllib. urllencode) to process data. However, the general idea of processing URLs is that the underlying encapsulated interfaces allow us to read URLs like local files.
The following is a code for obtaining the content of the Baidu page:

import urllib2  connect= urllib2.Request('http://www.baidu.com')  url1 = urllib2.urlopen(connect)  print url.read() 

After just four lines are run, the source code of the Baidu page is displayed. What is its mechanism?
When we use the urllib2.Request command, we send an HTTP request to the url searched by Baidu ("www.baidu.com") and map the request to the connect variable, after using urllib2.urlopen to operate connect, we will return the connect value to url1, and then we can perform operations on url1 like operating a local file, for example, here we use the read () function to read the source code of the url.
In this way, we can write a simple crawler of our own ~ Below are the crawlers I wrote to capture the connected objects of Tianya:

Import urllib2 url1 = "http://bbs.tianya.cn/post-16-835537-" url3 = ". shtml # ty_vip_look [% E6 % 8B % 89% E9 % A3 % 8E % E7 % 86% 8A % E7 % 8C % AB "for I in range (1,481 ): a = urllib2.Request (url1 + str (I) + url3) B = urllib2.urlopen (a) path = str ("D:/noval/Skyeye Uploader" + str (I) + ". html ") c = open (path," w + ") code = B. read () c. write (code) c. close print "current number of downloaded pages:", I

In fact, the above Code uses urlopen to achieve the same effect:

Import urllib2 url1 = "http://bbs.tianya.cn/post-16-835537-" url3 = ". shtml # ty_vip_look [% E6 % 8B % 89% E9 % A3 % 8E % E7 % 86% 8A % E7 % 8C % AB "for I in range (1,481 ): # a = urllib2.Request (url1 + str (I) + url3) B = urllib2.urlopen (url1 + str (I) + url3) path = str ("D: /noval/Tianyan descendant "+ str (I) + ". html ") c = open (path," w + ") code = B. read () c. write (code) c. close print "current number of downloaded pages:", I

Why do we need to request the url first? Here we need to introduce the concept of opener. When we use urllib to process URLs, we actually work through the urllib2.OpenerDirector instance, it calls resources to perform various operations, such as using the Protocol, opening URLs, and processing cookies. The urlopen method uses the default opener to handle the problem. That is to say, it is quite simple and rough ~ Our post data, header settings, proxy settings, and other requirements are completely insufficient.
Therefore, in the face of a slightly higher demand, we need to use urllib2.build _ opener () to create our own opener. I will write this part in detail in the next blog ~

For websites that do not have special requirements, we can use only the two modules of urllib to actually obtain the information we want, but some websites that need to simulate login or require permissions, we need to process cookies before we can capture the above information. At this time, we need the Cookielib module. The cookielib module is used to process cookies. The common method is to automatically process CookieJar () of cookies. It can automatically store cookies generated by HTTP requests, and automatically add cookies to outgoing HTTP requests. As mentioned above, to use it, you need to create a new opener:

import cookielib, urllib2 cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 

After such processing, the cookie problem is solved ~
To output cookies, run the print cj. _ cookies. values () command ~

Crawls Douban in the same city, and logs on to the library to query the return of books
After learning about the usage of several urllib modules, the next step is to go to the practical procedure ~

(1) crawling the Douban website for local activities

This link points to the list of Douban local activities and initiates a request to this link:

# encoding=utf-8 import urllib import urllib2 import cookielib import re    cj=cookielib.CookieJar() opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))   url="http://beijing.douban.com/events/future-all?start=0" req=urllib2.Request(url) event=urllib2.urlopen(req) str1=event.read() 

We will find that in the returned html code, in addition to the information we need, there is also a lot of page layout code:

As shown in, we only need the information about the activity. To extract information, we need a regular expression ~
Regular Expressions are a cross-platform string processing tool/method. Through regular expressions, we can easily extract what we want from strings ~
I will not introduce it in detail here. I personally recommend the regular expression guidance from instructor Yu Sheng, which is suitable for beginners. The general syntax of the regular expression is as follows:

Here, I use the capture group to group the four elements of the activity (name, time, location, and cost) as the standard. The obtained expression is as follows:

The Code is as follows:


Regex = re. compile (r'summary "> ([\ d \ D] *?) [\ D \ D] *? Class = "hidden-xs"> ([\ d \ D] *?) [\ D \ D] *? Strong> ([\ d \ D] *?)')

In this way, several parts can be extracted.

The overall code is as follows:

#-*-Coding: UTF-8-*-# program: Douban local crawler # author: GisLu # data: 2014-02-08 # Export import urllib import urllib2 import cookielib import re cj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) # Regular Expression extraction def search (str1): regex = re. compile (r 'summary "> ([\ d \ D] *?) [\ D \ D] *? Class = "hidden-xs"> ([\ d \ D] *?)
 
  
[\ D \ D] *? Strong> ([\ d \ D] *?)') For I in regex. finditer (str1): print "activity name:", I. group (1) a = I. group (2) B =. replace ('','') print B. replace ('\ n', '') print' Activity Location: ', I. group (3) c = I. group (4 ). decode ('utf-8') print 'fee:', c # Get url for I in range (): url = "http://beijing.douban.com/events/future-all? Start = "url = url + str (I * 10) req = urllib2.Request (url) event = urllib2.urlopen (req) str1 = event. read () search (str1)
 

Attention should be paid to the encoding problem here, because the version I used is still python2.X, so it is necessary to switch back and forth when passing internal Chinese character strings, for example, you must use
I. group (4). decode ('utf-8') converts the ASCII code in group (4 tuples into utf8 format. Otherwise, garbled characters are output.
In python, The RegEx module re provides two common global search methods: findall and finditer. findall is a one-time query, which consumes resources; finditer performs iterative search, which is recommended for personal comparison.

The final result is as follows ~

(2) simulate login to the library system to query the return of books

Since we can send a request to get information to a specified website through python, We can naturally perform login and other operations through the python simulated browser ~
The key to simulation is that the information we send to the specified website server must be in the same format as the browser ~
In this case, we need to analyze how the website we want to log on to receives information. Usually we need to capture packets for browser information exchange ~
Wireshark is a popular software for packet capture. It is quite powerful ~ However, for new users, the tools provided by IE, Foxfire, or chrome are enough for us to use ~

Here is an example of my school's library system ~

We can log on to the library management system to query the return of the books we borrowed. First, we need to capture packets and analyze what information we need to send to successfully simulate the browser login operation.

I am using the chrome browser. On the login page, press F12 to bring up chrome's own development tool. Select the network item and enter the student ID and password to log on.
Observe the network activity during the login process and find the suspicious elements:

After analyzing the post command, you can confirm that it is the key command for sending login information (account, password, etc.
Fortunately, our school is relatively poor and the website is generally used. This package is not encrypted at all ~ Then the rest is very simple ~ Write down headers and post data ~
Headers contains a lot of useful information. Some websites may determine whether you are a crawler program based on the user-Agent to decide whether to allow access, referer is often used by many websites to prevent leeching. If the referer in the request received by the server does not comply with the rules set by the Administrator, the server also rejects sending resources.
Post data is the information we post to the login server through the browser during the login process. Generally, account, password, and other data are included. There are often other data such as layout information that needs to be sent out. We usually do not have any sense of presence when operating the browser, but without them, the server will not respond to us.

Now we know all the formats of postdata and headers ~ Simulated login is easy:

Import urllib # Borrow # @ program: Book Borrowing query # @ author: GisLu # @ data: 2014-02-08 # ----------------------------------------- import urllib2 import cookielib import cj re = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ')] urllib2.install _ opener (opener) # log on to get cookies postdata = urllib. urlencode ({'user': 'x1234564', 'PW ': 'macialala', 'imagefield. x': '0', 'imagefield. y': '0'}) rep = urllib2.Request (url = 'HTTP: // 210.45.210.6/dzjs/login. asp ', data = postdata) result = urllib2.urlopen (rep) print result. geturl ()


Urllib. urlencode is responsible for automatic format conversion of postdata, while opener. addheaders adds our preset headers for subsequent requests in Our opener processor.
After the test, we found that the login was successful ~~
The rest is to find the url of the page where the book is borrowed and queried, and then use the regular expression to extract the information we need ~~
The overall code is as follows:

Import urllib # Borrow # @ program: Book Borrowing query # @ author: GisLu # @ data: 2014-02-08 # ----------------------------------------- import urllib2 import cookielib import cj re = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) ')] urllib2.install _ opener (opener) # Login Get cookies postdata = urllib. urlencode ({'user': 'x11134564', 'PW ': 'macialala', 'imagefield. x': '0', 'imagefield. y': '0'}) rep = urllib2.Request (url = 'HTTP: // 210.45.210.6/dzjs/login. asp ', data = postdata) result = urllib2.urlopen (rep) print result. geturl () # obtain the account limit Postdata = urllib. urlencode ({'cxfs': '1', 'submit1': 'search'}) aa = urllib2.Request (url = 'HTTP: // 210.45.210.6/dzjs/jhcx. asp ', data = Postdata) bb = urlli B2.urlopen (aa) cc = bb. read () zhangmu = re. findall ('tdborder4> (.*?)', Cc) for I in zhangmu: I = I. decode ('gb2312') I = I. encode ('gb2312') print I. strip ('')

The following is the program running result ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.