Python web crawler Learning Notes

Source: Internet
Author: User
Tags html form urlencode cpanel web hosting python web crawler


Python web crawler Learning Notes


by Zhonghuanlin



September 4 2014 Update: September 4 2014


Article Directory


1. Introduction:

2. start from the simple statement:

3. Transferring data to the server

4. HTTP Header-data that describes the data

5.   exception

5.0.1.  urlerror

5.0.2.  httperror

5.0.3.   handling Exceptions

5.0.4.  info and Geturl

6. opener and handler

7. Basic Authentication

8. Agent

9. Timeout Settings

Ten. Cookies

One . Debug Log

References:



Introduced:


The name of the crawler is very interesting, the English name Web spider. Really image, spider webs in order to get food, and our crawler, but also to obtain resources on the network. This blog is the record of my learning process. During the learning process, the language used is python2.7;python2.7 has two modules, Urllib and URLLIB2, and these two modules provide a good network access feature. The following will be better to understand. It is worth mentioning that in Python3, the two modules of Urllib and URLLIB2 are combined into one urllib. Interested in can see here



Urllib and URLLIB2 are powerful network working libraries in Python that make your network access like file access (for example, file access we want to open () a file first, it's also similar to the operation, you will see the example below). It's so convenient because the internals of these modules are good to use different network protocols to accomplish these functions, (learning that the web should understand that the simple process of accessing a Web page actually involves a lot of network protocols like Http,dns and so on, while Urllib and URLLIB2 encapsulate these protocols , so that we don't have to deal with them, we just need to invoke the methods of these modules to do the functions we need. At the same time, these modules provide some slightly more complex excuses to handle situations such as user authentication, cookies, proxies, and so on. Let's start by learning them.


Start from a simple statement:


As I said earlier, with two modules, accessing a webpage becomes as convenient as accessing a file. Under normal circumstances, URLLIB2 access will be better (more efficient, but urllib still need to use, the following describes the need to do something urllib), so let's look at the simplest example of using URLLIB2.








import urllib2;

response = urllib2.urlopen("http://www.zhonghuan.info");
html = response.read();
print html;


Under Terminal, enter command line python test.py > zhonghuan.html,
After opening the file, it shows the HTML code of my personal blog homepage:






This is the simplest example of using URLLIB2 to access a webpage, URLLIB2 is based on the URL: the previous section to determine what protocol access, for example, the above example we use the HTTP, here can also be replaced with ftp:,file:, etc... We don't have to know how it encapsulates these network protocols internally.



URLLIB2 can use a mirrored object (Request object) to represent our HTTP access, which indicates the URL address you want to access, let's take a look at the example below.








import urllib2

req = urllib2.Request('http://www.zhonghuan.info')
response = urllib2.urlopen(req)
the_page = response.read()
print(the_page)






The REQ variable is a request object. It exactly indicates the URL you want to access. (Here is Http://www.zhonghuan.info), for other forms of access, such as FTP and file, the form is similar, specifically can see [here][2];



In fact, the request object can do two additional things.


    1. You can send data to the server.
    2. You can send some extra information (also called metadata, which describes the data. The meta-classes in some languages are classes that generate classes, such as Python in these languages, so the metadata, as the name implies, describes the data, so what are these described data? Above, there is some information about the request object, and these descriptions are sent out in the HTTP header. For HTTP headers, you can see here);
Transferring data to the server


Sometimes, you need to send data to the server, this address is the URL, usually, the point of this address is the CGI (Common Gateway Interface) script or some other network applications. (For CGI scripts, you can see here, simply the script that handles uploading data). In HTTP access, which post is usually used to send the data, as if you fill out the HTML form, you need to send out the form of data, usually use the POST request. Of course, post use has other situations, not just the form.



Let's look at the following code first:






import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = ('name': 'Michael Foord',
       'location': 'Northampton',
       'language': 'Python'}
data = urllib.urlencode (values) #The data needs to be re-encoded into a suitable format. The method used in urllib is used here because there is no encoding method in urllib2
req = urllib2.Request (url, data) # #The data to be uploaded is passed to the equest object as its parameter
response = urllib2.urlopen (req)
the_page = response.read ()


For other types of data upload, you can see here



In addition to using post to upload data, you can also use Get method to upload data, get upload and post upload the obvious difference is that the get uploaded data will be in the URL of the tail display. You can look at the following code:






import urllib
import urllib2

data = ()
data ['name'] = 'Somebody Here'
data ['location'] = 'Northampton'
data ['language'] = 'Python'

url_values = urllib.urlencode (data)
print url_values # The order here is not necessarily

url = 'http://www.example.com/example.cgi'
full_url = url + '?' + url_values
data = urllib2.urlopen (full_url)


Can quietly print out the form of url_value.


HTTP header-data that describes the data


Now, let's discuss the HTTP header to see how to add an HTTP header to your HTTP request object.



There are some websites, it's smarter, it doesn't like being accessed by programs (non-human clicks only add to the burden of its servers). Or some sites are smarter, and for different browsers, different page data is sent.



However, urllib2 by default, this will mark themselves,Python-urllib/x.y(where x and y are the size version number, for example, when I am using itPython-urllib/2.7), and this data may confuse some sites, if you encounter a site that does not like to be visited by the program, then such access may be directly ignored. So, you can construct some identities so that the site does not reject you. Look at the following example.






import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla / 4.0 (compatible; MSIE 5.5; Windows NT)' #user_agent is used to identify your browser, here is mozilla
values = ('name': 'Michael Foord',
       'location': 'Northampton',
       'language': 'Python'}

headers = {'User-Agent': user_agent}
data = urllib.urlencode (values)

req = urllib2.Request (url, data, headers)

response = urllib2.urlopen (req)

the_page = response.read ()

Abnormal


Exceptions are often, beware! What is the exception when you think about general file operations? File does not open, what permissions are not enough Ah, the file does not exist Ah, etc. exception. Similarly, for URL access, you will encounter these problems (Python some internal exceptions, such as Valueerror,typeerror, may also occur)





Urlerror


First say Urlerror, when there is no network connection, or the server address is not present, in this case, Urlerror will be thrown out, at this time, the Urlerror exception will have a "reason" attribute, it is a tuple, containing the error Code (type int) and text error message (string), see the following







mport urllib
import urllib2


req = urllib2.Request('http://www.pretend_server.org')
try: urllib2.urlopen(req)
except urllib2.URLError as e:
    print e.reason


[Errno 8] nodename nor servname provided, or not knownThe output is reason,





Httperror


Every HTTP access, will get a "status code" from the server, usually these status codes tell us that the server can not meet some of the access (straightforward point is some data, but represents the status of the current access, such as the current access is denied, status Code can tell you which moves too much, out of bounds, attention, salty pig hand can not have AH ~ ~).



However, URLLIB2 's default processor can help you handle some of the server's response, such as the URL you are currently visiting is redirected by the server, that is, your server gives you a new URL access, the processor will help you directly to access the new URL.



However, the default processor is limited in function, it does not help you to solve all problems, such as the site you visit does not exist (corresponding to 404 errors, we sometimes see this error), or your access is forbidden (corresponding to 403 errors, the reason for the ban may be because you have insufficient authority, etc.), Or you need to verify it (corresponds to 401). Specific other errors This article does not introduce, specifically can see here



Let's look at the program below to see what it will output when the Httperror 404 error, which means that the page does not exist.







import urllib
import urllib2


req = urllib2.Request('http://www.zhonghuan.info/no_way')
try: urllib2.urlopen(req)
except urllib2.HTTPError as e:
       print e.code;
     print e.read();    


Output:
404



<!DOCTYPE html>



...



<title>Page not found &middot; GitHub Pages</title>



...





Handling Exceptions


Suppose you want to capture Httperror and Urlerror, there are two basic ways to recommend the second kind of Oh!



The first type:







from urllib2 import Request, urlopen, URLError, HTTPError


req = Request(http://zhonghuan.info)
try:
    response = urlopen(req)
except HTTPError as e:
    print 'The server couldn\'t fulfill the request.'
    print 'Error code: ', e.code
except URLError as e:
    print 'We failed to reach a server.'
    print 'Reason: ', e.reason
else:
    # everything is fine


The first method, Httperror must be placed in front of the urlerror, the reason, and many languages, like the exception processing mechanism, Httperror is a subclass of Urlerror, if there is httperror, it can be treated as urlerror is captured.



The second type:







from urllib2 import Request, urlopen, URLError


req = Request(someurl)
try:
       response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print 'We failed to reach a server.'
        print 'Reason: ', e.reason
    elif hasattr(e, 'code'):
        print 'The server couldn\'t fulfill the request.'
        print 'Error code: ', e.code
else:
    # everything is fine     


Info and Geturl


Here are two methods of info () and Geturl ();



Geturl (): The method returns the actual URL of the page visited, the value of which is that the page we are visiting may be redirected, so the URL that is visited may not be the same as what we entered. Look at the following example:







import urllib
import urllib2

url = 'http://weibo.com/u/2103243911';
req = urllib2.Request(url);
response = urllib2.urlopen(req)

print "URL:",url;
print "After redirection:",response.geturl();


In my microblog profile, for example, the real access is redirected, the real URL, from the output can be seen:



URL: http://weibo.com/u/2103243911

After redirection: http://passport.weibo.com/visitor/visitor?a=enter&url=http%3A%2F%2Fweibo.com%2Fu%2F2103243911&_rand=1409761358.1794





Info (): You can get information describing the page, return anhttplib.HTTPMessageinstance, and print it out like a dictionary. Look at the following code:







import urllib
import urllib2


url = 'http://zhonghuan.info';
req = urllib2.Request(url);
response = urllib2.urlopen(req);
print response.info();
print response.info().__class__;


Output:







Server: GitHub.com
Content-Type: text/html; charset=utf-8
Last-Modified: Tue, 02 Sep 2014 17:01:39 GMT
Expires: Wed, 03 Sep 2014 15:23:02 GMT
Cache-Control: max-age=600
Content-Length: 4784
Accept-Ranges: bytes
Date: Wed, 03 Sep 2014 16:38:29 GMT
Via: 1.1 varnish
Age: 5127
Connection: close
X-Served-By: cache-lax1433-LAX
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1409762309.465760,VS0,VE0
Vary: Accept-Encoding

Class: httplib.HTTPMessage




Opener and Handler


Here we introduce opener and handler.



What is opener? In fact, the above example we have been using opener, is Urlopen. This is the default opener, a lot of network access, you can create a more appropriate opener,



What is handler? In fact, opener will call handler to handle the trivial things in the access, so handler is important, for a particular protocol (such as Ftp,http), it knows how to handle access, such as it will help you handle the redirection problem.



At the time of the visit, you may have some requirements for opener, for example, the opener you want to be able to handle cookies, or you do not want opener to help you with redirection.



How do we generate the opener we need? (here, personally think of the opener generated here, and design patterns in the generator of European-style, also known as the builder pattern, the English name builder pattern; some similar, but not exactly the same, but always feel that before looking at it, it will be beneficial to understand the pattern before Friends who have not been contacted can read this builder pattern);



To create a opener, you can instantiate a openerdirector,
Then call. Add_handler (some_handler_instance).



However, you can use Build_opener, which is a more convenient function for creating opener objects, and he only needs one function call at a time. Build_opener adds several processors by default, but provides a quick way to add or update the default processor.
Other processor handlers you might want to handle proxies, validations, and other common but somewhat special situations.



Just mentioned handler will help us with redirection, but if we don't want to redirect it, what to do, customize a handler. Look at the following code:






mport urllib
import urllib2


class RedirectHandler (urllib2.HTTPRedirectHandler): # This RedirectHandler inherits from HTTPRedirectHandler, however, it overrides the method of the parent class, making it do nothing and loses the redirection function.
     def http_error_301 (self, req, fp, code, msg, headers):
         pass
     def http_error_302 (self, req, fp, code, msg, headers):
         pass


webo = "http://weibo.com/u/2103243911"; #I visited my Weibo page, because under normal circumstances, redirection occurs when accessing
opener = urllib2.build_opener (RedirectHandler) #Here, we have customized an opener and added a custom handler that is processed during redirection
response = opener.open (webo); # response = urllib2.urlopen (webo);
print response.geturl ();
urllib2.install_opener (opener); #Install a custom opener. When calling urllib2 in the future, this opener will be returned.


The output is:
urllib2.HTTPError: HTTP Error 302: Moved Temporarily



HTTP Error 302 occurred because it should have been redirected when I visited my Weibo profile, but our own redirect handler did nothing and the result was an exception.



You can look at the following urllib2 about creating a custom opener class diagram








Basic Authentication


If a website, he provides registration login these features, then generally it has a user name/password, if you visit the page, the system asks you to provide a username/password, this process is called authentication, the actual server that end of the operation. It provides security for some pages.



A Basic authentication (validation) process is this:


    1. The client requests access to certain pages.
    2. The server returned an error that required authentication.
    3. The client encodes the username/password (in general) and sends it to the server.
    4. The server checks whether the user name/password is correct, and then returns the page that the user requested or some error.


The above process, and possibly other forms, here is just a more general one.



Usually the server returns a 401 error indicating that the visited Web page is not authorized, and that the returned response header is visible within the



WWW-Authenticate: SCHEME realm="REALM".



The content, for example, you want to access the CPanel management application, you will receive such a header:WWW-Authenticate: Basic realm="cPanel"(CPanel is a set of the most prestigious commercial software in the web hosting industry, which is based on the Linux and BSD systems and PHP development and the nature of closed-source software; CPanel Mainly customer-oriented control system)



When we visit the page, opener calls Handler to handle a variety of situations, The handler that handles authentication is Urllib2.httpbasicauthhandler and requires a user password manager urllib2.httppasswordmgr.



Unfortunately, Httppasswordmgr has a small problem that you need to know about the realm before you get the page. Fortunately, it has a cousin Httppasswordmgrwithdefaultrealm, this cousin can not know realm in advance, in the realm parameter location, can pass a none in, it is more friendly to be used.



The following code is referenced below:






import urllib2


url = 'http: //www.weibo.com'# corresponding domain name, account, password
username = 'zhonghuan'
password = 'forget_it'
passman = urllib2.HTTPPasswordMgrWithDefaultRealm () #Create a password manager
passman.add_password (None, url, username, password) # Parameter form (realm, URL, UserName, Password)
authhandler = urllib2.HTTPBasicAuthHandler (passman) #Create an Authentication handler
opener = urllib2.build_opener (authhandler)
urllib2.install_opener (opener) #Same as described above. After install_opener, each time urlopen of urllib2 is called, this opener is returned.
pagehandle = urllib2.urlopen (url)




Agent


Sometimes, we do not have direct access to the machine, we need a proxy server to access. Urllib2 for this set of proxy support is also good, you can directly instantiate the Proxyhandler, its parameter is a Map,key value is the Access Protocol name of the agent, the value is the address of the proxy. Look at the following code implementation.







import urllib2


enable_proxy = True
proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
null_proxy_handler = urllib2.ProxyHandler({})
if enable_proxy:
    opener = urllib2.build_opener(proxy_handler)
else:
    opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)




Timeout setting


In older Python, the urllib2 API does not expose the timeout setting, and to set the timeout value, you can only change the global timeout value of the Socket.






import urllib2
import socket


socket.setdefaulttimeout (10) # timeout after 10 seconds
urllib2.socket.setdefaulttimeout (10) # Another way


After Python 2.6, timeouts can be set directly through the timeout parameter of Urllib2.urlopen ().






Import urllib2response = Urllib2.urlopen (' http://www.google.com ', timeout=10)




Cookies


Urllib2 the processing of cookies is also automatic. If you need to get the value of a Cookie entry, you can do this:







import urllib2
import cookielib


cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.google.com')
for item in cookie:
    if item.name == 'some_cookie_item_name':
        print item.value




Debug Log


When using URLLIB2, the debug Log can be opened by the following method, so that the contents of the transceiver will be printed on the screen, easy to debug, sometimes save the job of grasping the package







import urllib2


httpHandler = urllib2.HTTPHandler(debuglevel=1)
httpsHandler = urllib2.HTTPSHandler(debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.google.com')

Resources:
    1. Use of Python urllib2 (recommended)

    2. Python web crawler Introductory tutorial (recommended)

    3. Getting started with CGI scripts (recommended)

    4. Urllib2 Source Small Profile

    5. Usage details of Python standard library URLLIB2 (recommended)

    6. Authentication with Python (recommended)

    7. Http://en.wikipedia.org/wiki/List_of_HTTP_header_fields




Python web crawler Learning Notes


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.