A Get Baidu webpage HTML source code:
>>> Import requests>>> r=requests.get ("http://www.baidu.com") >>> R.status_code #查看状态码, 200 means access is successful, other means access failed 200>>> r.encoding= ' utf-8 ' #更改编码为utf-8 encoding >>> R.text #打印网页内容
>>> r.headers
The main methods of the two requests libraries are:
Requests.request () Constructs a request that supports the underlying methods of the following methods
Requests.get () Gets the main method of the HTML page, which corresponds to the get of the HTTP
Requests.head () method to get HTML page header information, corresponding to the head of HTTP
Requests.post () The method of submitting a POST request to an HTML page, corresponding to the post of the HTTP
Requests.put () to HTML want to also submit a put request method, corresponding to the HTTP put
Requests.patch () submits a local modification request to an HTML Web page that corresponds to the patch for HTTP
Requests.delete () submits a delete request to an HTML page that corresponds to the delete of the HTTP
1requests.get ()
R=requests.get ("Web Address") #get ("Web address") constructs a request object that requests resources from the server
#r是一个包含服务器资源的Response对象
Common properties of the three response objects (here is R):
R.status_code the status returned by the HTTP request, 200 means the link is successful, and 404 indicates the failure ... As long as it's not 200, it's a failure.
R.text A string form of the HTTP response content, that is, the content of the page content of the URL
R.encoding How the response content is encoded from the HTTP header
R.apparent_encoding response content encoding from the content (alternate encoding method)
R.content binary form of HTTP response content
>>> Import requests>>> r=requests.get ("http://www.baidu.com") >>> r.status_code200> >> r.text# garbled >>> r.encoding #从html代码的header查找charste关键字得到的编码方式, if not present charset is considered encoded as Iso-8859-1 ' Iso-8859-1 ' >>> r.apparent_encoding #分析后得到的网页的正确编码方式 ' utf-8 ' >>> r.encoding= ' utf-8 ' >> > R.text #打印出了我们想要的格式
A universal code framework for crawling Web pages
1. What is the generic code framework for crawling Web pages?
is a set of code, can be accurately and reliably crawl the Web page
Exceptions to the Requests library:
R.raise_for_status () If not 200, produces an abnormal requests. Httperror
>>> Import Requests >>> def gethtmltext (URL): Try:r=requests.get (url,timeout=30) r.raise_ For_status () #如果状态不是200, throw Httperror exception R.encoding=r.apparent_encodingreturn R.textexcept:return "Generate exception">> > If __name__== "__main__": Url= "http://www.baidu.com" Print (Gethtmltext (URL))
The red part is the common code framework for crawling Web pages.
MOOC "Python web crawler and Information extraction" learning process note "Requests library" the first week of 1-3