Python crawler: HTTP protocol, Requests library, pythonrequests
HTTP protocol:
HTTP (Hypertext Transfer Protocol): Hypertext Transfer Protocol. A URL is the Internet path for accessing resources over HTTP. a url corresponds to a data resource.
HTTP operations on resources:
The Requests Library provides all basic HTTP request methods. Introduction: http://www.python-requests.org/en/master
Six main methods of the Requests Library:
Requests library exception:
Two important objects of the Requests Library: Request and Response ). The Request object supports multiple Request methods. The Response object contains all the information returned by the server and the Request information of the Request.
Attributes of the Response object:
R. encoding indicates that if the header does not contain charset, the encoding is ISO-8859-1.
R. raise_for_status () can directly know whether r. status_code is equal to 200.
Comparison between HTTP protocol and Requests Library:
The general code framework for crawling webpages:
1 try: 2 r = requests. get (url, timeout = 30) 3 r. raise_for_status () 4 # If the status is not 200, an HTTPError exception is thrown. encoding = r. apparent_encoding6 return r. text7 handle T: 8 return 'exception occurred'
For example, to obtain the PMCAFF homepage information:
1 import requests 2 3 def getHtmlText (url): 4 try: 5 r = requests. get (url, timeout = 30) 6 r. raise_for_status () 7 r. encoding = r. apparent_encoding 8 return r. text 9 Failed T: 10 return 'produces exception '11 12 if _ name _ =' _ main _ ': 13 url = 'https: // www.pmcaff.com/'14 print (getHtmlText (url ))
Common Code framework for crawling webpages: Mac, Python 3.6, and PyCharm 2016.2
Reference: MOOC course "Python web crawler and information extraction" of Chinese University
----- End -----
Author: du wangdan, Public Account: du wangdan, Internet product manager.