Python web crawler (i)

Last Update:2018-03-05 Source: Internet

Author: User

Tags session id python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Urllib basic usage of sending requests

The basic usage is to invoke the request library,
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)It's OK to fill in these property values into the parameters you want before you write the code.

Advanced usage

The processor---Handler is described. Use it to process cookies, set up proxies, and everything in any HTTP request.
This first introduces the Basehandler class in the next Urllib.request module, which is the parent of all other Handler, which provides the most basic Handler methods, such as the Default_open (), the Protocol_request () method, and so on. Then there are various Handler subclasses inheriting this Basehandler class, for example several of the following:

Httpdefaulterrorhandler is used to handle HTTP response errors, and errors will throw exceptions of type Httperror. The
Httpredirecthandler is used to handle redirects. The
httpcookieprocessor is used to process Cookies. The
Proxyhandler is used to set the proxy, and the default proxy is empty. The
httppasswordmgr is used to manage passwords, and it maintains a table of user name passwords.
Httpbasicauthhandler is used to manage authentication, and it can be used to resolve authentication issues If a link is open and requires authentication.
More Handler can refer to Https://docs.python.org/3/library/urllib.request.html#basehandler-objects
Urlopen is required in normal request, and handler is required to use Opener,urlopen () This method is a opener provided by Urllib for us.
Authentication
Some websites must enter a user name and password before they can continue to appear. Similar web sites such as the router's Administrator login interface (in the browser input 192.168.1.1 Jump interface). This is not to bypass the login interface, but to simulate the request without error.

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_openerfrom urllib.error import URLErrorusername = 'username'password = 'password'url = 'http://localhost:5000/'p = HTTPPasswordMgrWithDefaultRealm()p.add_password(None, url, username, password)auth_handler = HTTPBasicAuthHandler(p)opener = build_opener(auth_handler)try:    result = opener.open(url)    html = result.read().decode('utf-8')    print(html)except URLError as e:    print(e.reason)

Agent

Set up Proxy

from urllib.error import URLErrorfrom urllib.request import ProxyHandler, build_openerproxy_handler = ProxyHandler({    'http': 'http://127.0.0.1:9743',    'https': 'https://127.0.0.1:9743'})opener = build_opener(proxy_handler)try:    response = opener.open('https://www.baidu.com')    print(response.read().decode('utf-8'))except URLError as e:    print(e.reason)

First take some time to introduce cookies, when you log on to the site we often find that some sites are automatically logged in, and for a long time will not expire, some of us need to re-login (enter the user model and password), which actually involves the session and the knowledge of cookies. HTTP protocol is a stateless protocol, meaning that the client and the server to interact with the data at the time does not save the operation, for example: when we load a picture of the Web page, transmitted only half of the time to stop loading, if we do not add to the request to the servers, The server does not remember the operation of the previous transfer, and the operation performed by the server is to re-transmit the full data of this picture again. Therefore, the stateless state of HTTP means that the HTTP protocol is not capable of remembering transactions, which means that the server does not know what the client is. . This avoids the redundancy of the information (which does not require the HTTP protocol to save the transaction while the data is being transferred), but for pages that require a user to log on, we certainly do not want to re-transmit the previous request all at once, so that the resources are wasted. So the technology to keep the HTTP connection state comes up. The session is on the server side (the website's servers), which is used to save the user's conversation information, cookies on the client. The server recognizes the cookie and identifies which user it is, then determines whether the user is logged in, and then returns the corresponding Response. So, we can understand this: if we put a successful login after the cookies in request headers directly to the server, you do not have to re-simulate the landing.
When the browser requests the website next time, the browser will put this cookie on request Headers to the server, the cookie carries the session ID information, the server checks the cookie to find the corresponding session, Then judge the Session to identify the user status. So when we log on to a website, the server will tell the client which cookies to set when the login is successful, the client will send the cookies to the server when the subsequent access page, the server can find the corresponding session to judge, if the session Some of the settings in the login status of the variable is valid, that proves that the user is logged in, you can return to the login to view the content of the Web page, the browser can see the resolution.
OK, here's a description of the cookie-related handler in Python, first of all, to get the cookies from the website:

import http.cookiejar, urllib.requestcookie = http.cookiejar.CookieJar()    #申明一个CookieJar对象handler = urllib.request.HTTPCookieProcessor(cookie) #创建Handleropener = urllib.request.build_opener(handler)   #构建出Openerresponse = opener.open('http://www.baidu.com')  #执行open()函数for item in cookie:    print(item.name+"="+item.value)

The results of the implementation are as follows:

BAIDUID=2E65A683F8A8BA3DF521469DF8EFF1E1:FG=1BIDUPSID=2E65A683F8A8BA3DF521469DF8EFF1E1H_PS_PSSID=20987_1421_18282_17949_21122_17001_21227_21189_21161_20927PSTM=1474900615BDSVRTM=0BD_HOME=0

The above is the complete look of the cookies, we can save this cookie to facilitate our call later:

filename = 'cookies.txt'#cookie = http.cookiejar.MozillaCookieJar(filename)cookie = http.cookiejar.LWPCookieJar(filename)handler = urllib.request.HTTPCookieProcessor(cookie)opener = urllib.request.build_opener(handler)response = opener.open('http://www.baidu.com')cookie.save(ignore_discard=True, ignore_expires=True)

At this point the Cookiejar needs to be replaced with Mozillacookiejar, it is required to generate the file, it is a subclass of Cookiejar, can be used to handle cookies and file-related events, read and save cookies, it can be stored in cookies The format of the Mozilla browser's Cookies.
The resulting Cookies.txt file is as follows:

#LWP-Cookies-2.0Set-Cookie3: BAIDUID="0CE9C56F598E69DB375B7C294AE5C591:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2084-10-14 18:25:19Z"; version=0Set-Cookie3: BIDUPSID=0CE9C56F598E69DB375B7C294AE5C591; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2084-10-14 18:25:19Z"; version=0Set-Cookie3: H_PS_PSSID=20048_1448_18240_17944_21089_21192_21161_20929; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0Set-Cookie3: PSTM=1474902671; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2084-10-14 18:25:19Z"; version=0Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

Next, use the cookie's Lwpcookiejar format to read the file:

cookie = http.cookiejar.LWPCookieJar()cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)handler = urllib.request.HTTPCookieProcessor(cookie)opener = urllib.request.build_opener(handler)response = opener.open('http://www.baidu.com')print(response.read().decode('utf-8')

Urllib Handling Exceptions

In the run program to get data, if the program encountered errors in the middle of the time we did not write exception processing, as far as possible to run the data lost; in obtaining the Watercress movie top250, some of the movie parameters are incomplete, causing the crawler will always be in the middle of the error ... More likely, if the network situation suddenly change, to do exception handling, yes, network recovery can continue to run the program. In conclusion, it is very important to write exception handling!!

Httperror

Code, which returns the HTTP status code, that is, status codes, such as 404 pages not present, 500 server internal errors, and so on.
Reason, like the half-class, returns the cause of the error.
Headers, return to Request headers.

from urllib import request,errortry:    response = request.urlopen('http://没有这个页面.com/index.htm')except error.HTTPError as e:    print(e.reason, e.code, e.headers, seq='\n')

Python web crawler (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More