Python in urllib+urllib2+cookielib module writing crawler combat

Source: Internet
Author: User
Tags urlencode
Hypertext Transfer Protocol HTTP forms the foundation of the World Wide Web, which uses URIs (Uniform Resource Identifiers) to identify data on the Internet, and the URI of the specified document address is called a URL (Uniform Resource Locator), a common URL that points to a file, Directories or objects that perform complex tasks (such as database lookups, Internet searches), and crawlers essentially access and manipulate these URLs to get what we want. For those of us who do not have a business need, there are a lot of needs to be done with the URLLIB,URLLIB2 and cookielib three modules if you want to write a crawler.
The first thing to note is that URLLIB2 is not a urllib upgrade, although the same as dealing with the URL of the relevant module, the personal recommendation to use the Urllib2 interface, but we do not use URLLIB2 completely replace the urllib, Handling URL resources sometimes requires some functions in urllib, such as Urllib.urllencode, to process the data. But the idea of both handling URLs is through the bottom-encapsulated interface that allows us to read URLs like local files.
Here is a get Baidu page content code:

Import urllib2  connect= urllib2. Request (' http://www.baidu.com ')  URL1 = Urllib2.urlopen (connect)  

Just 4 lines after the run, it will show the source code of the Baidu page. What is the mechanism of it?
When we use URLLIB2. Request command, we sent an HTTP request to the URL ("www.baidu.com") that Baidu searched for and mapped the request to the Connect variable. When we use Urllib2.urlopen to operate connect, we return the value of connect to the URL1, and then we can operate on the URL1 as if we were working on a local file, for example here we use the Read () function to read the source code of the URL.
In this way, we can write a simple crawler of their own-the following is the crawler I wrote to crawl the Tianya:

Import urllib2 url1= "http://bbs.tianya.cn/post-16-835537-" url3= ". shtml#ty_vip_look[%e6%8b%89%e9%a3%8e%e7%86%8a% E7%8c%ab "For I in Range (1,481):   a=urllib2. Request (Url1+str (i) +url3)   B=urllib2.urlopen (a)   path=str ("d:/noval/Day-Eye" +str (i) + ". html")   C=open ( Path, "w+")   code=b.read ()   c.write (code)   C.close   

In fact, the code above uses Urlopen to achieve the same effect:

Import urllib2 url1= "http://bbs.tianya.cn/post-16-835537-" url3= ". shtml#ty_vip_look[%e6%8b%89%e9%a3%8e%e7%86%8a% E7%8c%ab "For I in Range (1,481):   #a =urllib2. Request (Url1+str (i) +url3)   B=urllib2.urlopen ((Url1+str (i) +url3)   path=str ("d:/noval/Day-Eye" +str (i) + ". html ")   C=open (Path," w+ ")   code=b.read ()   c.write (code)   C.close   

Why do we need to request a URL first? There is a need to introduce the concept of opener, and when we use Urllib to process URLs, we actually pass the URLLIB2. Openerdirector the instance, he invokes the resource for various operations such as protocol, open URL, cookie processing, and so on. The Urlopen method uses the default opener to deal with the problem, which means that it's quite simple and rude-it's not enough for us to post data, set headers, and set up proxies.
Therefore, in the face of a slightly higher demand, we need to use Urllib2.build_opener () to create their own opener, this part of the content I will be in the next blog detailed written ~

For some sites without special requirements, just using the Urllib 2 modules can actually get the information we want, but some need to simulate landing or need permission to the site, we need to deal with cookies to successfully crawl the above information, The Cookielib module is needed at this time. The Cookielib module is specifically designed to handle cookie-related, the more commonly used method is the ability to automatically process the cookie Cookiejar (), it can automatically store the HTTP request generated by the cookie, and automatically adds a cookie to the outgoing HTTP request. As I mentioned earlier, if you want to use it, you need to create a new opener:

After this process, the cookie problem is solved ~
If you want to output cookies, use the print cj._cookies.values () command to

Crawl the same city, log in the library to check the return of books
After mastering the relevant usage of Urllib several modules, the next step is to enter the actual combat steps ~

(a) Crawl Douban station with the city activities

Watercress Beijing city Activities The link points to the list of activities in the Watercress city and initiates a request to the link:

# encoding=utf-8 Import urllib import urllib2 import cookielib import re    cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ))   

We will find that the HTML code returned, in addition to the information we need, is interspersed with a lot of page layout code:

As shown, we only need information about activities in the middle. And in order to extract the information, we need the regular expression ~
Regular expressions are a cross-platform string processing tool/method that, through regular expressions, we can easily extract the content we want in a string ~
Here does not do the detailed introduction, the personal recommendation Yu Yu Teacher's regular guideline, is suitable for the novice entry. The approximate syntax for a regular expression is given below:

Here I use capture grouping to group the four elements of the activity (name, time, location, cost) to the standard, resulting in the following expressions:

Copy CodeThe code is as follows:


Regex=re.compile (R ' Summary "> ([\d\d]*?)[\d\d]*?class= "Hidden-xs" > ([\d\d]*?) [\d\d]*?strong> ([\d\d]*?)')

This makes it possible to extract a few parts.

The overall code is as follows:

#-*-Coding:utf-8-*-  #---------------------------------------  # program  : Watercress with the City crawler   #  AUTHOR:GISLU #  data:2014-02-08  #---------------------------------------   import urllib Import urllib2 Import cookielib Import re   cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ))   #正则提取   def search (str1):   regex=re.compile (R ' Summary "> ([\d\d]*?) [\d\d]*?class= "Hidden-xs" > ([\d\d]*?)
 <>
  • [\d\d]*?strong> ([\d\d]*?)] For I in Regex.finditer (str1): print "Activity name:", I.group (1) A=i.group (2) b=a.replace (",") print B.replace (' \ n ', ') print ' event location: ', I.group (3) C=i.group (4). Decode (' Utf-8 ') print ' Cost: ', C # Get URL for i in range (0,5): url= "http://beijing.douban.com/events/future-all?start=" url=url+str (I*10) Req=urllib2. Request (URL) event=urllib2.urlopen (req) str1=event.read () Search (STR1)
  • Here you need to pay attention to the problem of coding, because I use the version or Python2. X, so it needs to be converted back and forth when the internal kanji string is passed, such as when you print the item "cost" At the end, you must use
    I.group (4). Decode (' Utf-8 ') converts the ASCII code in Group 4 to UTF8 format, otherwise it will find that the output is garbled.
    In Python, the regular module re provides two commonly used global lookups: FindAll and Finditer, where FindAll is a one-time process and consumes resources, while Finditer is an iterative search, which is recommended by individuals.

    The final results are as follows.

    (ii) Simulation of the Landing library system to check the return of books

    Since we are able to make requests for information through Python to the specified website, it is also possible to log in via Python simulation browser.
    The key to simulation is that the information we send to the specified Web server needs to be exactly the same as the browser format.
    This will need to analyze the way we want to access the information on the site. Usually we need to grab a packet for the browser's information exchange ~
    Grab package software, the current is more popular is Wireshark, quite powerful ~ but for us novice, IE, Foxfire or Chrome comes with the tool is enough for us to use the ~

    Here is my school's library system for example ~

    We can use the simulation login, and finally enter the library management system to inquire about the return of the books we borrow. The first thing to do is to capture the packet to analyze what information we need to send to successfully simulate the browser for login.

    I am using a Chrome browser, on the landing page press F12 to bring up Chrome's own development tools, select the network entry can enter the number of password to choose Login.
    Observing the network activity during the landing, I found the suspect molecule:

    After analyzing the post command, you can confirm that it is the key command to send the login information (account number, password, etc.).
    Fortunately, our school is relatively poor, the site to do the general, this package is not encrypted ~ Then the rest is very simple ~ Note headers and post data OK ~
    There are a lot of useful information in headers, some sites may be based on user-agent to determine whether you are bots to decide whether to allow you access, and referer is often used to anti-hotlinking many sites, If the server receives a request that Referer does not match the rules set by the administrator, the server also refuses to send the resource.
    and post data is the information we post to the login server during the login process, usually the account, password and other data are included inside. There are often other data, such as layout and other information to be sent out, which usually we operate the browser when there is no sense of existence, but without their servers will not respond to our drip.

    Now PostData and headers format we all know ~ Simulation Landing is very simple:

    Import Urllib #---------------------------------------  #  @program: Book borrowing query   #  @author: gislu #  @ data:2014-02-08  #---------------------------------------  import urllib2 import cookielib import re  cj= Cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) opener.addheaders = [(' User-agent ', ' mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)]  Urllib2.install_opener (opener)  #登陆获取cookies postdata=urllib.urlencode ({     ' user ': ' X1234564 ',     ' pw ': ' Demacialalala ',     ' imagefield.x ': ' 0 ',     ' imagefield.y ': ' 0 '} '      rep=urllib2. Request (       url= ' http://210.45.210.6/dzjs/login.asp ',       data=postdata     


    Where Urllib.urlencode is responsible for converting postdata automatically, and opener.addheaders is adding our preset headers for subsequent requests in our opener processor.
    After the test found that landing success ~ ~
    Then the rest is to find out the book borrow also query the URL of the page, and then use regular expressions to extract the information we need ~ ~
    The overall code is as follows:

    Import Urllib #---------------------------------------# @program: Book borrowing query # @author: gislu # @data: 2014-02-08 #------ ---------------------------------Import urllib2 Import cookielib import re cj=cookielib. Cookiejar () Opener=urllib2.build_opener (urllib2. Httpcookieprocessor (CJ)) opener.addheaders = [(' User-agent ', ' mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)] Urllib2.install_opener (opener) #登陆获取cookies Postdata=urllib.urlencode ({' User ': ' X11134564 ', ' PW ': ' Demacialalala ', ' imagefield.x ': ' 0 ', ' imagefield.y ': ' 0 '}) rep=urllib2. Request (url= ' http://210.45.210.6/dzjs/login.asp ', Data=postdata) Result=urllib2.urlopen (rep) Print Resul T.geturl () #获取账目表 postdata=urllib.urlencode ({' Ncxfs ': ' 1 ', ' submit1 ': ' Search '}) aa=urllib2. Request (url= ' http://210.45.210.6/dzjs/jhcx.asp ', data=postdata) bb=urllib2.urlopen (AA) Cc=bb.read () Zhangmu=re.fin Dall (' Tdborder4 > (. *?)', cc) for I in Zhangmu:i=i.decode (' gb2312 ') i=i.encode (' gb2312 ') print I.strip (')

    Here is the result of the program run ~

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.