Python3 web crawler Learning-Basic Library Usage (2)

Last Update:2018-08-19 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2.request

First on the instance

Import urllib.requestrequest = urllib.request.Request (' https://python.org ') response = Urllib.request.urlopen ( Request) print (Response.read (). Decode (' Utf-8 '))

As before, the content of the Python website was generated, but this time we are constructing a request class that we can separate the requests into an object, or we can configure the parameters

Class.urllib.request.Request (url, data = none, headers = {}, Origin_req_host = none, unverifiable = False, method = N One

The first argument is a required argument, and the others are optional.
If the second data is to be transmitted, it must be transmitted by the type of the byte stream (bytes), and if it is a dictionary, it can be encoded with the UrlEncode () in the Urllib.parse module first.
The third headers parameter is a dictionary, which is the request header, can be added by Add_header (), can be modified user-agent to disguise as a browser, such as to disguise as a Firefox browser,

mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11

The fourth one is the host name or IP address of the requester
The fifth indicates that the request is not verifiable, the default is False, for example, we want to crawl a picture of a document, but there is no automatic access to the permission, this is the parameter is True
The sixth is a string that is used only for the requested method, such as Get,post,put, etc.

import urllibfrom urllib Import parse,requesturl = ' http://httpbin.org /post ' headers = {' user-agent ': ' Mozilla/4.0 (compatible; MSIE 5.5;  Windows NT) ' host = httpbin.org '}dict = {' name ': ' germey '}data = bytes (Parse.urlencode (dict), encoding= ' utf-8 ') req = Urllib.request.Request (url=url, data = data, headers = headers, method = ' POST ') response = Urllib.request.urlopen (req) Print (Response.read (). Decode (' Utf-8 ')) ===================== restart:f:\python\exercise\ok.py =================== =={"args": {}, "Data": "", "Files": {}, "form": {"name": "Germey"}, "headers": {"accept-encoding": "ID     Entity "," Connection ":" Close "," content-length ":" One "," Content-type ":" application/x-www-form-urlencoded ", "Host": "httpbin.org", "user-agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) host = httpbin.org "}," JSON ": null," origin ":" 182.110.15.26 "," url ":" Http://httpbin.org/post "}>> ;>

The API for the bytes () function is:
         Class bytes([source[, encoding[, errors]])

If source is an integer, returns an initialized array of length source;
If source is a string, the string is converted to a sequence of bytes according to the specified encoding;
If source is an iterative type, the element must be an integer in [0, 255];
If source is an object that is consistent with the buffer interface, this object can also be used to initialize the ByteArray.
If you do not enter any parameters, the default is to initialize the array to 0 elements.

The UrlEncode function under parse can convert a key value pair like Key-value to the format we want, and return a string like a=1&b=2.
The request class can also be passed Add_header (' user-agent ': ' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)
Here are some more advanced uses, our tool handle.
The first thing you need to know about all handler's parent class Basehandler class is that he has many subclasses:

Httpdefaulterrorhandler: for handling HTTP response errors, errors throw httperror exceptions
Httpredirecthandler: for handling directed
Httpcookieprocessor: For processing cookies
Proxyhandler: Used to set proxy, default proxy is empty
Httppasswordmgr: A table for managing passwords, maintaining usernames and passwords
Httpbasicauthhandler: For management certification, link open requires authentication, he can be used to solve the certification problem

There are Openerdirector class, before the use of Urlopen is actually open opener class, the front of the urlopen is equivalent to encapsulating the request method, now this equivalent to a further step configuration request function, using the opener class

It can use the open method, although the returned type is consistent with Urlopen ()

Verify

Httpbasicauthhandler Processor (Web Client authorization authentication)

Some sites are authenticated when they are opened, prompting for a user name and password

 import urllib.requestfrom urllib.request Import Httppasswordmgrwithdefaultrealm,httpbasicauthhandler,build_openerfrom urllib.error Import URLErrorusername = ' Username ' password = ' password ' url = ' http://localhost:4000/' #构建一个密码管理对象, used to save the user name and password to be processed P = Urllib.request.HTTPPasswordMgrWithDefaultRealm () #添加账户信息, the first parameter realm is the domain information related to the remote server, usually no one is none, the following three parameters are the Web server, User name, password p.add_password (None, URL, username, password) #构建一个HTTP基础用户名/password-authenticated Httpbasicauthhandler Processor object, parameter is the password-managed object created Auth _handler=httpbasicauthhandler (p) #通过 the Build_opener () method to create a custom opener object using these proxy handler objects, including the built Proxy_handleropener = Build_opener (Auth_handler) #然后直接在Opener发送请求时就相当于已经验证成功了try: Urllib.request.install_opener (opener) result = Opener.open (URL) html = Result.read (). Decode (' Utf-8 ') print (HTML) except Urlerror as E:print (E.reason)

===================== restart:f:\python\exercise\ok.py =====================
[Winerror 10061] Unable to connect because the target computer was actively rejected.

Agent

Proxybasicauthhandler (proxy authorization verification)

Use the Proxyhandler function to handle the code directly

Import Urllib.request # Private Agent Authorized account user = "Mr_mao_hacker" # Private proxy authorized password passwd = "sffqry9r" # Private proxy ipproxyserver = "61.158.16 3.13:16,816 "# 1. Build a password management object to hold the user name and password to be processed passwdmgr = Urllib.request.HTTPPasswordMgrWithDefaultRealm () # 2. Add account information, the first parameter realm is the domain information related to the remote server, generally nobody cares that it is write none, the following three parameters are Proxy server, username, password Passwdmgr.add_password (None, ProxyServer, user , passwd) # 3. Build a proxy base username/password Authentication Proxybasicauthhandler Processor object, parameter is the password management object created # Note that the normal Proxyhandler class is no longer used here Proxyauth_handler = Urllib.request.ProxyBasicAuthHandler (passwdmgr) # 4. Use these proxy handler objects with the Build_opener () method to create custom opener objects, including the built proxy_handler and Proxyauth_handleropener = Urllib.request.build_opener (Proxyauth_handler) # 5. Structuring request Requests
#这里构造的Request类相当于一个自定义的请求, but the customization only defines his URL, no difference from just entering the URL
Request = Urllib.request.Request ("http://www.baidu.com/") # 6. Send request using custom opener response = Opener.open (Request) # 7. Print the response contents print (Response.read (). Decode (' uft-8 '))

For the agent can also refer to this article:79074219

First of all, what is a cookie, which is the data (a. txt format) that the server holds on your computer, so that the server can use it to identify your computer. When you are browsing the site, the Web server will send a small piece of information on your computer, and the Cookies will help you record the text or some of the choices you make on the site. The next time you visit the same site, the Web server will first see if it has the last cookie information, if so, it will be based on the contents of the cookie to determine the user, send a specific page content to you.

First example how to get the cookies off the website:

Import http.cookiejar,urllib.request# must first declare a Cookiejar object
Cookie = Http.cookiejar.CookieJar () handler = Urllib.request.HTTPCookieProcessor (cookie) opener = Urllib.request.build _opener (handler) response = Opener.open (' http://www.baidu.com ') for item in cookie:    print (item.name+ "=" + Item.value) ====================== restart:f:\python\exercise\1.py ======================baiduid= 76b5f4d5a2ef8add2571babcbaa63f79:fg=1bidupsid=76b5f4d5a2ef8add2571babcbaa63f79h_ps_pssid=1993_1454_21093_26350 _26922_22158pstm=1534469599bdsvrtm=0bd_home=0delper=0

Then according to a blogger's blog on the Internet, some other learning 69817490

Then I keep the cookie information

Import http.cookiejar,urllib.requestfilename = ' cookies.txt ' cookie = http.cookiejar.MozillaCookieJar (filename) Handler = Urllib.request.build_opener (cookie) opener = Urllib.request.build_opener (handler) response = Opener.open (' Http://www.baidu.com ') cookie.save (Ignore_discard = true, Ignore_expires = True)

The Discovery program appears as

This pit is left here first, if any of the great gods know where I was wrong, please help me to point out

Python3 web crawler Learning-Basic Library Usage (2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More