python-web crawler (1)

Source: Internet
Author: User
Tags http authentication ssl certificate

Install the request Library
1, run inside input CMD Direct input PIP install requests Enter, you can install
2, enter Python directly in the terminal to enter Python's own idle
3, the following command to crawl Baidu page information content

C:\users\ftsdata-02>python #输入python进入IDLE
Python 3.6.5 (V3.6.5:F59C0932B4, Mar 2018, 17:00:18) [MSC v.1900-bit (AMD64)] on Win32
Type "Help", "copyright", "credits" or "license" for more information.
>>> Import Requests #导入requests库
>>> r = Requests.get ("http://www.baidu.com") #用requests库里面的get方法获取百度网址内容信息
>>> R #查看获取返回信息值, 200 for success
<response [200]>
>>> R.status_code #运用status_code也可查看获取网页是否成功, showing 200, the successful acquisition
200
>>> r.encoding= ' Utf-8 ' #将获取的百度网页字符码转换成utf-8 character encoding
>>> R.text #显示爬出网页内容
' <! DOCTYPE html>\r\n<!--STATUS ok-->

Summarize:
The 7 main methods of the requests library are:
Requests.request () Constructs a request that supports the underlying methods of the following methods

Requests.request (method, URL, **kwargs)
Method: Request method, corresponding to Get/put/post and other seven kinds
URL: URL link to get page
**kwargs: A total of 13 parameters for control access.
Requests.request (' GET ', url, **kwargs)
Requests.request (' HEAD ', url, **kwargs)
Requests.request (' POST ', url, **kwargs)
Requests.request (' PUT ', url, **kwargs)
Requests.request (' PATCH ', url, **kwargs)
Requests.request (' DELETE ', url, **kwargs)
Requests.request (' OPTIONS ', url, **kwargs)

**kwargs: Parameters that control access are optional
Params: Dictionary or sequence of bytes, added as a parameter to the URL
>>> kv={' key1 ': ' value1 ', ' key2 ': ' value2 '}
>>> r=requests.request (' GET ', ' Http://python123.io/ws ', params=kv)
>>> Print (R.url)
Http://python123.io/ws?key1=value1&key2=value2

Data: Dictionary, byte sequence, or file object as the content of the request
>>> kv={' key1 ': ' value1 ', ' key2 ': ' value2 '}
>>> r=requests.request (' POST ', ' Http://python123.io/ws ', data=kv)
>>> body= ' subject content '
>>> r.requests.request (' POST ', ' Http://python123.io/ws ', data=body)

Data in Json:json format as content of request
>>> kv={' key1 ': ' value1 '}
>>> r=requests.request (' POST ', ' Http://python123.io/ws ', json=kv)

Headers: dictionary, HTTP Custom header
>>>hd={' user-agent ': ' CHROME/10 '}
>>>r=requests.request (' POST ', ' Http://python123.io/ws ', HEADERS=HD)

Cookies: Dictionaries or Cookiejar, cookies in Request
Auth: Tuple, support HTTP authentication function
Files: dictionary type, transferring files
>>>fs={' file ': Open (' Data.xls ', ' RB ')
>>>r=requests.request (' POST ', ' Http://python123.io/ws ', FILES=FS)

Timeout: Sets the timeout, in seconds
>>>r=requests.request (' GET ', ' http://www.baidu.com ', timeout=10)

Proxies: Dictionary type, set Access Proxy, can increase login authentication
>>>pxs={' http ': ' Http://user:[email protected]:1234 ', ' https ': ' https://10.10.10.1:4321 '}
>>>r=requests.request (' GET ', ' http://www.baidu.com ', PROXIES=PXS)

Allow_redirects:true/false, default true, redirect switch
Stream:true/false, the default is True, gets the content download now switch
Varify:true/false, default is True, authentication SSL certificate switch
Cert: Local SSL certificate path


Requests.get () Gets the main method of the HTML page, corresponding to the get of the meter HTTP
Requests.head () method to get HTML page header information, corresponding to the head of HTTP
Requests.post () The method of submitting a POST request to an HTML page, corresponding to the post of the HTTP
Requests.put () The method of submitting a put request to an HTML page, corresponding to the put of the HTTP
Requests.patch () submits a local modification request to an HTML Web page that corresponds to the patch for HTTP
Requests.delete () submits the delete request to the HTML page, which corresponds to the delete of the HTTP


r = Requests.get (URL) Constructs a Request object requesting a resource from the server, and returns a response object containing the server's resources;
Requests.get (URL, params = None, **kwargs)
URL: URL link to get page
Params:url additional parameters, dictionary or byte stream format, optional **kwargs:12 control access parameters

2 important objects of the requests library
Request and Response objects

The Response object contains all the content that the crawler returned
>>> Import Requests #导入requests库
>>> r = Requests.get ("http://www.baidu.com")
>>> Print (R.status_code)
200
>>> type (R)
<class ' Requests.models.Response ' >
>>> r.headers
{' Cache-control ': ' Private, No-cache, No-store, Proxy-revalidate, No-transform ', ' Connection ': ' keep-alive ', ' Content-encoding ': ' gzip ', ' content-type ': ' text/html ', ' Date ': ' Tue, June 2018 11:48:31 GMT ', ' last-modified ': ' Mon, 23 Jan 13:27:36 GMT ', ' Pragma ': ' No-cache ', ' Server ': ' bfe/1.0.8.18 ', ' set-cookie ': ' bdorz=27315; max-age=86400; domain=.baidu.com; path=/', ' transfer-encoding ': ' chunked '}
>>>

Properties of the Response object:
R.status_code HTTP request return status, 200 table for connection success, 404 for Failure
R.text A string form of the HTTP response content, that is, the page content of the URL
R.encoding How the response content is encoded from the HTTP header
R.apparent_encoding response content encoding from the content (alternate encoding method)
R.content binary form of HTTP response content

>>> r.apparent_encoding
' Utf-8 '
>>> r.encoding= ' Utf-8 '
>>> R.test

Understand the coding of response
R.encoding How the response content is encoded from the HTTP header
R.apparent_encoding response content encoding from the content (alternate encoding method)
R.encoding: If CharSet is not present in the header, it is considered encoded as ISO-88591
R.apparent_encoding: Based on the content of the page analysis of the encoding method

Understanding the exceptions of the requests library
Requests. Connectionerror Network connection Error exception, such as DNS query failure, connection denied, etc.
Requests. Httperror HTTP Error exception
Requests. Urlrequired URL Missing exception
Requests. Toomanyredirects the maximum number of redirects, resulting in a redirect exception
Requests. ConnectTimeout connection Remote server timeout exception
Requests. Timeout request URL Timeout, exception generated

R.raise_for_status () If not 200, produces an abnormal requests. Httperror

>>> Import Requests
>>> def gethtmltext (URL):
... try:
... r=requests.get (url,timeout=30)
... r.raise_for_status ()
... r.encoding=r.apparent_encoding
... return R.text
... except:
.. return
...
>>> if __name__== ' __main__ ':
... url= "http://www.baidu.com"
.. print (Gethtmltext (URL))

HTTP protocol: Hypertext Transfer Protocol.
HTTP is a stateless application-layer protocol based on the "request and response" mode.
The HTTP protocol uses URLs as the identifier for locating network resources.
URL format: http://host[:p Ort][path]
Host: A legitimate Internet host domain name or IP address
Port: Port number, the default is 80
Path: Request Resource Path

URLs are Internet paths that access resources through the HTTP protocol, and a URL corresponds to a data resource.

The HTTP protocol operates on resources:
Get request resource to get URL location
Head requests the URL location resource's response message report, which obtains the header information for that resource
Post request appends new data to the URL location's resource
Put request stores a resource to the URL location, overwriting the resource at the original URL location
PATCH requests a resource that updates the URL location locally, that is, part of the resource at that point
Delete request deletes the resource stored in the URL location

Understand the difference between patch and put
Suppose the URL location has a set of data userinfo, including the Userid,username and so on 20 fields.
Requirements: The user modified the username, the other unchanged.
With patches, only local update requests for username are submitted to the URL.
With put, all 20 fields must be submitted to the URL, and uncommitted fields are deleted.

The biggest advantage of patch: Save network bandwidth

Post () method for the Requests library
Next example: Post a dictionary to a URL, automatically encoded as a form (form)
>>> payload={' key1 ': ' value1 ', ' key2 ': ' value2 '}
>>> r=requests.post (' Http://httpbin.org/post ', data=payload)
>>> Print (R.text)
{...
"Form": {
"Key2": "value2",
"Key1": "Value1
},
}

The put () method of the Requests library
>>> payload={' key1 ': ' value1 ', ' key2 ': ' value2 '}
>>> r=requests.put (' Http://httpbin.org/put ', data=payload)
>>> Print (R.text)
{
...
"Form": {
"Key2": "value2",
"Key1": "Value1"
},
}

python-web crawler (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.