Python3 web crawler Learning-Basic Library Usage (2)

Source: Internet
Author: User
Tags urlencode

2.request

First on the instance

Import urllib.requestrequest = urllib.request.Request (' https://python.org ') response = Urllib.request.urlopen ( Request) print (Response.read (). Decode (' Utf-8 '))

As before, the content of the Python website was generated, but this time we are constructing a request class that we can separate the requests into an object, or we can configure the parameters

Class.urllib.request.Request (url, data = none, headers = {}, Origin_req_host = none, unverifiable = False, method = N One

    • The first argument is a required argument, and the others are optional.
    • If the second data is to be transmitted, it must be transmitted by the type of the byte stream (bytes), and if it is a dictionary, it can be encoded with the UrlEncode () in the Urllib.parse module first.
    • The third headers parameter is a dictionary, which is the request header, can be added by Add_header (), can be modified user-agent to disguise as a browser, such as to disguise as a Firefox browser,

mozilla/5.0 (X11; U Linux i686) gecko/20071127 firefox/2.0.0.11

    • The fourth one is the host name or IP address of the requester
    • The fifth indicates that the request is not verifiable, the default is False, for example, we want to crawl a picture of a document, but there is no automatic access to the permission, this is the parameter is True
    • The sixth is a string that is used only for the requested method, such as Get,post,put, etc.
import urllibfrom urllib Import parse,requesturl = ' http://httpbin.org /post ' headers = {' user-agent ': ' Mozilla/4.0 (compatible; MSIE 5.5;  Windows NT) ' host = httpbin.org '}dict = {' name ': ' germey '}data = bytes (Parse.urlencode (dict), encoding= ' utf-8 ') req = Urllib.request.Request (url=url, data = data, headers = headers, method = ' POST ') response = Urllib.request.urlopen (req) Print (Response.read (). Decode (' Utf-8 ')) ===================== restart:f:\python\exercise\ok.py =================== =={"args": {}, "Data": "", "Files": {}, "form": {"name": "Germey"}, "headers": {"accept-encoding": "ID     Entity "," Connection ":" Close "," content-length ":" One "," Content-type ":" application/x-www-form-urlencoded ", "Host": "httpbin.org", "user-agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) host = httpbin.org "}," JSON ": null," origin ":" 182.110.15.26 "," url ":" Http://httpbin.org/post "}>> ;> 
The API for the bytes () function is:
Class bytes([source[, encoding[, errors]])
    • If source is an integer, returns an initialized array of length source;
    • If source is a string, the string is converted to a sequence of bytes according to the specified encoding;
    • If source is an iterative type, the element must be an integer in [0, 255];
    • If source is an object that is consistent with the buffer interface, this object can also be used to initialize the ByteArray.
    • If you do not enter any parameters, the default is to initialize the array to 0 elements.
The UrlEncode function under parse can convert a key value pair like Key-value to the format we want, and return a string like a=1&b=2.
The request class can also be passed Add_header (' user-agent ': ' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)
Here are some more advanced uses, our tool handle.
The first thing you need to know about all handler's parent class Basehandler class is that he has many subclasses:
    • Httpdefaulterrorhandler: for handling HTTP response errors, errors throw httperror exceptions
    • Httpredirecthandler: for handling directed
    • Httpcookieprocessor: For processing cookies
    • Proxyhandler: Used to set proxy, default proxy is empty
    • Httppasswordmgr: A table for managing passwords, maintaining usernames and passwords
    • Httpbasicauthhandler: For management certification, link open requires authentication, he can be used to solve the certification problem

There are Openerdirector class, before the use of Urlopen is actually open opener class, the front of the urlopen is equivalent to encapsulating the request method, now this equivalent to a further step configuration request function, using the opener class

It can use the open method, although the returned type is consistent with Urlopen ()

    • Verify

Httpbasicauthhandler Processor (Web Client authorization authentication)

Some sites are authenticated when they are opened, prompting for a user name and password

 import urllib.requestfrom urllib.request Import Httppasswordmgrwithdefaultrealm,httpbasicauthhandler,build_openerfrom urllib.error Import URLErrorusername = ' Username ' password = ' password ' url = ' http://localhost:4000/' #构建一个密码管理对象, used to save the user name and password to be processed P = Urllib.request.HTTPPasswordMgrWithDefaultRealm () #添加账户信息, the first parameter realm is the domain information related to the remote server, usually no one is none, the following three parameters are the Web server, User name, password p.add_password (None, URL, username, password) #构建一个HTTP基础用户名/password-authenticated Httpbasicauthhandler Processor object, parameter is the password-managed object created Auth _handler=httpbasicauthhandler (p) #通过 the Build_opener () method to create a custom opener object using these proxy handler objects, including the built Proxy_handleropener = Build_opener (Auth_handler) #然后直接在Opener发送请求时就相当于已经验证成功了try: Urllib.request.install_opener (opener) result = Opener.open (URL) html = Result.read (). Decode (' Utf-8 ') print (HTML) except Urlerror as E:print (E.reason)  

===================== restart:f:\python\exercise\ok.py =====================
[Winerror 10061] Unable to connect because the target computer was actively rejected.



    • Agent

Proxybasicauthhandler (proxy authorization verification)

Use the Proxyhandler function to handle the code directly

Import Urllib.request # Private Agent Authorized account user = "Mr_mao_hacker" # Private proxy authorized password passwd = "sffqry9r" # Private proxy ipproxyserver = "61.158.16 3.13:16,816 "# 1. Build a password management object to hold the user name and password to be processed passwdmgr = Urllib.request.HTTPPasswordMgrWithDefaultRealm () # 2. Add account information, the first parameter realm is the domain information related to the remote server, generally nobody cares that it is write none, the following three parameters are Proxy server, username, password Passwdmgr.add_password (None, ProxyServer, user , passwd) # 3. Build a proxy base username/password Authentication Proxybasicauthhandler Processor object, parameter is the password management object created # Note that the normal Proxyhandler class is no longer used here Proxyauth_handler = Urllib.request.ProxyBasicAuthHandler (passwdmgr) # 4. Use these proxy handler objects with the Build_opener () method to create custom opener objects, including the built proxy_handler and Proxyauth_handleropener = Urllib.request.build_opener (Proxyauth_handler) # 5. Structuring request Requests
#这里构造的Request类相当于一个自定义的请求, but the customization only defines his URL, no difference from just entering the URL
Request = Urllib.request.Request ("http://www.baidu.com/") # 6. Send request using custom opener response = Opener.open (Request) # 7. Print the response contents print (Response.read (). Decode (' uft-8 '))

For the agent can also refer to this article:79074219

    • Cookies

First of all, what is a cookie, which is the data (a. txt format) that the server holds on your computer, so that the server can use it to identify your computer. When you are browsing the site, the Web server will send a small piece of information on your computer, and the Cookies will help you record the text or some of the choices you make on the site. The next time you visit the same site, the Web server will first see if it has the last cookie information, if so, it will be based on the contents of the cookie to determine the user, send a specific page content to you.

First example how to get the cookies off the website:

Import http.cookiejar,urllib.request# must first declare a Cookiejar object
Cookie = Http.cookiejar.CookieJar () handler = Urllib.request.HTTPCookieProcessor (cookie) opener = Urllib.request.build _opener (handler) response = Opener.open (' http://www.baidu.com ') for item in cookie: print (item.name+ "=" + Item.value) ====================== restart:f:\python\exercise\1.py ======================baiduid= 76b5f4d5a2ef8add2571babcbaa63f79:fg=1bidupsid=76b5f4d5a2ef8add2571babcbaa63f79h_ps_pssid=1993_1454_21093_26350 _26922_22158pstm=1534469599bdsvrtm=0bd_home=0delper=0

Then according to a blogger's blog on the Internet, some other learning 69817490

Then I keep the cookie information

Import http.cookiejar,urllib.requestfilename = ' cookies.txt ' cookie = http.cookiejar.MozillaCookieJar (filename) Handler = Urllib.request.build_opener (cookie) opener = Urllib.request.build_opener (handler) response = Opener.open (' Http://www.baidu.com ') cookie.save (Ignore_discard = true, Ignore_expires = True)

The Discovery program appears as

This pit is left here first, if any of the great gods know where I was wrong, please help me to point out

Python3 web crawler Learning-Basic Library Usage (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.