[Python crawler topic] Parsing Method & lt; 1 & gt; Urllib library method summary, pythonurllib

Source: Internet
Author: User

[Python crawler topic] Parsing Method <1> Urllib library method summary, pythonurllib
What is Urllib: A python built-in HTTP request library mainly contains four modules:

Urllib. request module: This module is used to simulate sending page requests
Urllib. error Exception Handling Module: ensures that the program will not be accidentally terminated due to running errors.
Urllib. parse url parsing module: Used as a page processing tool module
The parsing module of urllib. robotparser robots.txt is used to parse the robots file of the website.

Urlopen

Urllib. request. urlopen (url, data = None, [timeout,] *, cafile = None, capath = None, cadefault = False, context = None)

Mainly use the first three parameters, url, data (set some data in the post request), timeout

Request

When sending complicated request requests, for example, when you add headers in the request, you can use the Request method.
# Note: we use this method to set Request requests. We should use the urlopen method to use the Request method to set the request, and the returned response is
For example:
 

# Request settings from urllib import request, parseurl = 'HTTP: // httpbin.org/post'headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 "," Host ":" httpbin.org "} dict = {'name ': "Custom"} data = bytes (parse. urlencode (dict), encoding = 'utf-8') req = request. request (url = url, headers = headers, method = 'post', data = data) response = request. urlopen (req) print (response. read (). decode ("UTF-8 "))
Handler

There are many Handler introductions in the official python documentation. Here, as the first handler to be learned, it is mainly a proxy Handler (proxy-handler)
By using a proxy, you can change the IP address used by the requested resources. In this way, if the user is banned during illegal operations, you can change the IP address to continue running. However, it is best not to do this, it is not easy for people who operate websites (→ _ →)
The proxy method is relatively simple. Write down this routine:
  

Import urllib. requesturl = "http://www.baidu.com" proxy_handler = urllib. request. proxyHandler ({"http": 'http: // 127.45.97.1: 9743 ', "https": 'http: // 127.45.97.1: 9743 '}) # note that the dictionary format opener = urllib is passed in. request. build_opener (proxy_handler) response = opener. open (url) print (response. read ())
Cookie (data stored on the user's local Terminal)

Cookie is a method in which the server or script can maintain information on the customer's workstation under the HTTP protocol. A Cookie is a small text file stored on a user's browser (client) by a Web server. It can contain user information. Whenever a User links to a server, the Web site can access Cookie information.
On the page, the most common cookies are logon information. When making crawlers, we can use cookies to maintain logon information, so as to avoid logon verification required by some websites. Similarly, if you delete cookies on your own, the logon status will automatically exit.
  
Below is a demonstration with a small piece of code

# about cookiesimport http.cookiejar, urllib.requestcookie = http.cookiejar.CookieJar()handler = urllib.request.HTTPCookieProcessor(cookie) #this is also a handleropener = urllib.request.build_opener(handler)response = opener.open("http://www.baidu.com")for item in cookie:    print(item.name + " = " + item.value)

We can also save cookies as txt files, but the usage is different:
In the above Code, cookies should be declared as follows:
Cookie = http. cookiejar. MozillaCookieJar (filename)
At the end, you should call the cookie. save method to save the file.

Similarly, cookies also have other formats, such as the most common LWP format. The sample code is as follows:

# Cookie can also save in a 'lwp 'formatimport http. cookiejar, urllib. request # Save cookies to local filename = 'cookies.txt 'cookie = http. cookiejar. LWPCookieJar (filename) handler = urllib. request. HTTPCookieProcessor (cookie) opener = urllib. request. build_opener (handler) response = opener. open ("http://www.baidu.com") cookie. save (ignore_discard = True, ignore_expires = True) # Load cookiescookie. load ("cookies.txt", ignore_discard = True, ignore_expires = True) print (response. read (). decode ('utf-8 '))

In this way, there is no problem when accessing some pages that can only be accessed by login.

Exception Handling

Open the official python3 document and query the urllib. error documents. We can find that there are three types of errors generated by urllib. request:
URLError, HTTPError, and ContentTooShortError. The exception method is basic to python.

from urllib import request,errorurl = '######'try:    response = request.urlopen(url)except error.HTTPError as e:    print(e.reason,e.code,e.headers,sep='\n')except error.URLError as e:    print(e.reason)else:    print("request successfully!")
Url Parsing urlparse

Urllib. parse. urlparse (urlstring, scheme = ", allow_fragments = True)
Url parsing, as its name implies, is to break down the url and analyze it into several different fragments. When the input urlstring has no protocol type, we can specify the default Protocol scheme in the function.
If the parameter allow_fragments = True is set to False, the fragment part in the urlstring will not be separately divided, and I will know how to splice it into the previous part such as path and query.
  

Urlunparse

This is equivalent to the urlparse inverse function. You can input a list containing url components to form a complete urlstring.

Urlencode

Converts a dictionary object into a get request parameter. All get request parameters are "url ?" + "Parameter 1" & "parameter 2 "......, We can splice a url with a dictionary containing parameters to get a complete get request.
Example:

from urllib.parse import urlencodeparams = {        'name' : 'vincent',        'age' : '19',        'occupation' : 'student'}base_url = 'http://www.baidu.com?'url = base_url + urlencode(params)print(url)

Running result:

http://www.baidu.com?name=vincent&age=19&occupation=student
The above methods are used a lot. Pay attention to the frequent review!Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger. Http://blog.csdn.net/hiha_hero2333/article/details/79150848

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.