HTTP protocol:
HTTP (Hypertext Transfer Protocol): The Hypertext Transfer Protocol. URLs are Internet paths that access resources through the HTTP protocol, and a URL corresponds to a data resource.
The HTTP protocol operates on resources:
The Requests library provides all the basic request methods for HTTP. Official Introduction:
The 6 main methods of the requests library are:
Exceptions to the Requests library:
Two important objects of the Requests library: request, Response (corresponding). The request object supports a variety of method requests, and the response object contains all the information returned by the server and also contains the requested request information.
Properties of the Response object:
Where r.encoding refers to: If CharSet is not present in the header, the encoding is considered to be iso‐8859‐1.
R.raise_for_status () can directly know if R.status_code equals 200.
HTTP protocol vs. Requests library:
Common code framework for crawling Web pages:
1 try:2 r = requests.get (Url,timeout = 5) 3 R.raise_for_status () 4 # If the status is not 200, the Httperror exception is raised. r.encoding = R.apparent_encoding6 return r.text7 except:8 return ' produces an exception '
For example, get information about the Pmcaff home page:
1 Import Requests 2 3 def gethtmltext (URL): 4 try:5 r = requests.get (Url,timeout =) 6 r.raise_for_ Status () 7 r.encoding = r.apparent_encoding 8 return r.text 9 except:10 return ' generate exception ' if __name __ = = ' __main__ ': url = ' print ' (gethtmltext (URL))
Common code framework for crawling Web pages: Operating environment: Mac,python 3.6,pycharm 2016.2
Reference: Chinese University MOOC course "Python web crawler and Information extraction"
-----End-----
Author: Du Wangdan, public number: Du Wangdan, Internet Product Manager.