Python web crawler and Information Extraction--1.requests Library Introduction

Source: Internet
Author: User
Tags http authentication ssl certificate python web crawler

1. More Information http://www.python-requests.org

2. Install: Win platform: "Run as Administrator" cmd, execute pip install requests

The seven main methods of the 3.requests library are:

Requests.request () Constructs a request that supports the underlying methods of the following methods
Requests.get () Gets the main method of the HTML page, which corresponds to the get of the HTTP
Requests.head () method to get HTML page header information, corresponding to the head of HTTP
Requests.post () The method of submitting a POST request to an HTML page, corresponding to the post of the HTTP
Requests.put () The method of submitting a put request to an HTML page, corresponding to the put of the HTTP
Requests.patch () submits a local modification request to an HTML Web page that corresponds to the patch for HTTP
Requests.delete () submits a delete request to an HTML page that corresponds to the delete of the HTTP

4.get () method

(1) R = Requests.get (URL)

Get (URL) constructs a request object that requests resources from the server

R returns a response object that contains server resources

(2) Requests.get (URL, params=none, **kwargs)
? URL: URL link to get page
? Additional parameters in Params:url, dictionary or byte stream format, optional
? **kwargs:12 Parameters for control access

(3) Properties of the Response object:

R.status_code The return status of the HTTP request, 200 indicates a successful connection, and 404 indicates a failure
R.text A string form of the HTTP response content, that is, the page content of the URL
R.encoding How the response content is encoded from the HTTP header
R.apparent_encoding response content encoding from the content (alternate encoding method)
R.content binary form of HTTP response content

R.encoding: If CharSet is not present in the header, it is considered encoded as iso‐8859‐1
R.text displaying Web content based on r.encoding
R.apparent_encoding: The encoding method analyzed according to the content of the webpage can be regarded as the alternative of r.encoding

5. Common code Framework for crawling Web pages

(1) Requests anomalies

Requests. Connectionerror Network connection Error exception, such as DNS query failure, connection denied, etc.
Requests. Httperror HTTP Error exception
Requests. Urlrequired URL Missing exception
Requests. Toomanyredirects the maximum number of redirects, resulting in a redirect exception
Requests. ConnectTimeout connection Remote server timeout exception
Requests. Timeout request URL timed out, resulting in timeout exception

(2) Response anomalies

R.raise_for_status () If not 200, produces an abnormal requests. Httperror

(3) General code framework

Import requests

def gethtmltext (URL):

Try

R=requests.get (url,timeout=30)

R.raise_for_status ()

R.encoding=r.apparent_encoding

Return R.text

Except

Return "Generating Exception"

If _name_= "_main_":

Url= "Http://www.baidu.com"

Print (Gethtmltext (URL))

6.HTTP protocol

(1) The URL format is as follows: http://host[:p Ort][path]
Host: A legitimate Internet host domain name or IP address
Port: Port number, the default is 80
Path: Paths to request resources

(2) Operation of the HTTP protocol on resources

Get request resource to get URL location
Head requests a response message report for the URL location resource that obtains header information for the resource
Post request appends new data to the URL location's resource
Put request stores a resource to the URL location, overwriting the resource at the original URL location
PATCH requests a resource that updates the URL location locally, which changes part of the resource
Delete request deletes the resource stored in the URL location

Post a dictionary to the URL automatically encoded as a form (form)

Post a string to the URL automatically encoded as data

The main method of 7.requests library

(1) Requests.request (method, URL, **kwargs)
? Method: Request method, corresponding to Get/put/post and other 7 kinds
? URL: URL link to get page
? **kwargs: Control access parameters, total 13

? Method: Request Method
r = requests.request (' GET ', url, **kwargs)
r = requests.request (' HEAD ', url, **kwargs)
r = requests.request (' POST ', url, **kwargs)
r = requests.request (' PUT ', url, **kwargs)
r = requests.request (' PATCH ', url, **kwargs)
r = requests.request (' delete ', URL, **kwargs)
r = requests.request (' OPTIONS ', url, **kwargs)

**kwargs: Parameters that control access are optional
Params: Dictionary or sequence of bytes, added as a parameter to the URL:

Data: Dictionary, byte sequence, or file object as the content of the request

Data in Json:json format as content of request

Headers: dictionary, HTTP Custom header

Cookies: Cookies in dictionaries or cookiejar,request

Auth: Tuple, support HTTP authentication function

Files: dictionary type, transferring files

Timeout: Sets the time-out, in seconds

Proxies: Dictionary type, set Access Proxy, can increase login authentication

Allow_redirects:true/false, default true, redirect switch

Stream:true/false, the default is True, gets the content download now switch

Verify:true/false, default is True, authentication SSL certificate switch

Cert: Local SSL certificate path

(2) Requests.get (URL, params=none, **kwargs)

? URL: URL link to get page
? Additional parameters in Params:url, dictionary or byte stream format, optional
? **kwargs:12 Parameters for control access

(3) Requests.head (URL, **kwargs)

? URL: URL link to get page
? **kwargs:12 Parameters for control access

(4) Requests.post (URL, Data=none, Json=none, **kwargs)

? URL: URL link for the page you intend to update
? Data: Dictionary, byte sequence, or file, content of request
? Json:json format data, request content
? **kwargs:12 Parameters for control access

(5) Requests.put (URL, data=none, **kwargs)

? URL: URL link for the page you intend to update
? Data: Dictionary, byte sequence, or file, content of request
? **kwargs:12 Parameters for control access

(6) Requests.patch (URL, data=none, **kwargs)

? URL: URL link for the page you intend to update
? Data: Dictionary, byte sequence, or file, content of request
? **kwargs:12 Parameters for control access

(7) Requests.delete (URL, **kwargs)

? URL: The URL link of the page you wish to delete
? **kwargs:12 Parameters for control access

Python web crawler and Information Extraction--1.requests Library Introduction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.