Python Crawler---Requests library usage

Source: Internet
Author: User

Requests is a simple and easy-to-use HTTP library implemented by Python, which is much simpler than urllib.

Because it is a third-party library, CMD installation is required before use

PIP Install requests

Once the installation is complete, import it, and normal means you can start using it.

Basic usage:

Requests.get () is used to request the target site, and the type is a HttpResponse type

Import= requests.get ('http://www.baidu.com')
Print (Response.status_code) # Printing status code
Print request URL (response.url) #
Print (Response.headers) # printer header information
Print (response.cookies) # printing cookie information print (Response.text ) #以文本形式打印网页源码
Print (response.content) #以字节流形式打印

Operation Result:

Status code: 200

Url:www.baidu.com

Headers information

Various Request methods:

ImportRequestsrequests.get ('Http://httpbin.org/get') Requests.post ('Http://httpbin.org/post') Requests.put ('Http://httpbin.org/put') Requests.delete ('Http://httpbin.org/delete') Requests.head ('Http://httpbin.org/get') requests.options ('Http://httpbin.org/get')

Basic GET Request

Import= requests.get ('http://httpbin.org/get')Print (Response.text)

Results

Get request with Parameters:

The first one directly places the parameter inside the URL

Import= requests.get (http://httpbin.org/get?name=gemey&age=22)print( Response.text)

Results

The other parameter is filled in dict, and the params parameter is specified as dict when the request is initiated

Import= {    'name''Tom',     '  Age ':requests.get ('http://httpbin.org/get', params=  Data)print(response.text)

Results Ibid.

Parsing JSON

Import= requests.get ('http://httpbin.org/get')print (Response.text) Print (Response.json ())  # Response.json () method with Json.loads (Response.text) Print (Type (Response.json ()))

Results

Save a binary file simply

Binary content is response.content

Import= requests.get ('http://img.ivsky.com/img/tupian/pre/201708/30/ Kekeersitao-002.jpg'= response.contentwith open ('f://fengjing.jpg  ','wb') as F:    F.write (b)

Add header information to your request

Import Requests
Heads = {}
heads[' user-agent '] = ' mozilla/5.0 ' \
' (Macintosh; U Intel Mac OS X 10_6_8; En-us) applewebkit/534.50 ' \
' (khtml, like Gecko) version/5.1 safari/534.50 '
= Requests.get ('http://www.baidu.com', headers=headers)

Using proxies

With the addition of the headers method, the proxy parameter is also a dict

This uses the requests library to crawl IP and ports and types of IP proxy sites

Because it is free, the proxy address used soon becomes invalid.

ImportRequestsImportRedefget_html (URL): Proxy= {        'http':'120.25.253.234:812',        'HTTPS' '163.125.222.244:8123'} heads={} heads['user-agent'] ='mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.221 safari/537.36 SE 2.X METASR 1.0'req= Requests.get (URL, headers=heads,proxies=proxy) HTML=Req.textreturnHTMLdefget_ipport (HTML): Regex= R'<td data-title= "IP" > (. +) </td>'IPList=Re.findall (regex, HTML) regex2='<td data-title= "PORT" > (. +) </td>'portlist=Re.findall (regex2, HTML) regex3= R'<td data-title= "type" > (. +) </td>'typelist=Re.findall (regex3, HTML) sumray= []     forIinchIPList: forPinchportlist: forTinchtypelist:Pass            Passa= t+','+i +':'+P Sumray.append (a)Print('High Stealth Agent')    Print(Sumray)if __name__=='__main__': URL='http://www.kuaidaili.com/free/'get_ipport (get_html (URL))

Results:

Basic POST Request:

Import= {'name':'Tom','age '  ':' + ' = requests.post ('/http Httpbin.org/post', Data=data)

Get cookies

# Get Cookies Import  = requests.get ('http://www.baidu.com')print( Response.Cookies)print(type (response.cookies)) for in Response.cookies.items ():    Print(k +':'+v)

Results:

Session Maintenance

Import= requests. Session () session.get ('http://httpbin.org/cookies/set/number/12345'= Session.get ('http://httpbin.org/cookies')print(response.text)

Results:

Certificate Validation Settings

Import Requests  from Import urllib3urllib3.disable_warnings ()   # remove warning from urllib3 response = requests.get ('https://www.12306.cn', Verify=false)  # certificate validation set to Falseprint(response.status_code)

Printed results: 200

Timeout exception capture

Import Requests  from Import ReadTimeout Try :     = Requests.get ('http://httpbin.org', timeout=0.1)    Print (Res.status_code) except readtimeout:     Print (timeout)

Exception handling

Try to use try...except to catch exceptions when you are not sure what errors will occur

All requests Exception:

Exceptions
ImportRequests fromRequests.exceptionsImportreadtimeout,httperror,requestexceptionTry: Response= Requests.get ('http://www.baidu.com', timeout=0.5)    Print(Response.status_code)exceptreadtimeout:Print('Timeout')exceptHttperror:Print('Httperror')exceptrequestexception:Print('Reqerror')

Python Crawler---Requests library usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.