Python 2.7
IDE Pycharm 5.0.3
Requests 2.10
It's time to meditate and have a good look at requests.
Installation Method
I only say here under pycharm+anaconda2 How to add requests package, as for how to install Anaconda2 under Pycharm, see @zhusleep and @ Muziki answer
Then the installation is probably like this, simple and quick, no Pip, no easy install,anaconda2 is so powerful, roar roar
use case
Get GitHub's public timeline, URL is Https://github.com/timeline.json
If you open this link, then this should be the case
OK, how did we handle it before we met requests?
Of course, with Urllib2,urllib.
Import urllib2
url= ' Https://github.com/timeline.json '
req=urllib2. Request (URL)
response = Urllib2.urlopen (req)
html = response.read ()
Print HTML
It's an error.
Is it a code problem?
Try again
Import urllib2
url= ' http://www.bing.com '
req=urllib2. Request (URL)
response = Urllib2.urlopen (req)
html = response.read ()
Print HTML
Everything is normal, can catch the content, this is why.
is the Web page dynamic. I'll change selenium to catch it.
From selenium import webdriver
url= ' Https://github.com/timeline.json ' #Github时间线
#driver = webdriver. Firefox ()
driver = Webdriver. Phantomjs (executable_path= ' Phantomjs.exe ')
driver.get (URL)
pre = Driver.find_element_by_xpath ('//body/ Pre ')
print Pre.text
Successful crawl, so say, exactly what is the problem, is not Dynamic Web page make ghost it ... Next time I know, I'll add. Add a
First of all, 410 gone to make an introduction
410 Gone Obsolete
The request resource is no longer available on the source server and no forwarding address is available. This condition is considered to be permanent. A client with a link-editing capability should delete a reference to the request URI after the user confirms it. If the server does not know or is not easy to determine whether the condition is permanent, then this 404 (no discovery) state response will be replaced by the exploit. The response is cacheable, unless otherwise stated. The 410 response is intended primarily for web maintenance tasks by telling the recipient that the resource is already unavailable and telling the recipient that the server owner has removed the remote connection to that resource. This event (410 response) is very common for time-bound, promotional services, and for resources that no longer work at the server site. It does not need to mark all long-lived resources as "gone" or for any length of time-This requires the server owner's own judgment
On the URL = ' Https://github.com/timeline.json ' page, in fact, URLLIB2 did not handle error, state 410, and then I use requests to test the state is also so
Import requests,os,time
url = ' Https://github.com/timeline.json '
start = Time.clock ()
html = requests.get (url,allow_redirects=true)
End = Time.clock ()
print Html.status_code
print Html.text
Results
410
{"message": "Hello there, Wayfaring stranger." If you are reading this then you probably didn ' t, we blog post a couple of years back announcing the this API would go Away:http://git.io/17arog fear not, your should is able to get what your need from the shiny new Events API instead. "," Doc Umentation_url ":" Https://developer.github.com/v3/activity/events/#list-public-events "}
But for requests can catch things urllib2, but I still don't know requests why can catch things and this URL can indeed open in the browser, GitHub is really a magical place, next time know to add
Static and dynamic Web pages complement
Traditional crawler use static download, the advantage of static download is fast download process, but the page is just a boring HTML, so the page link analysis to get only the < a > tags of the href attribute or the master can analyze their own js,form tags such as capture some links. You can use the URLLIB2 module or the requests module to implement functionality in Python. Dynamic reptiles have a special advantage in the web2.0 era, as Web pages are processed using JavaScript, and Web content is asynchronously acquired via Ajax. Therefore, dynamic crawlers need to analyze the page after JavaScript processing and Ajax fetching content. At present, the simple solution is to deal directly with the module based on WebKit. These three modules, PYQT4, splinter and selenium, can achieve the goal. For reptiles, the browser interface is not required, so using a headless browser is very cost-effective, htmlunit and PHANTOMJS are available headless browser. Business Time Adopt requests Crawl
Import requests
url= ' Https://github.com/timeline.json '
html=requests.get (URL)
print Html.text
Successfully crawled as follows
{"Message": "Hello there, Wayfaring stranger." If you are reading this then you probably didn ' t, we blog post a couple of years back announcing the this API would go Away:http://git.io/17arog fear not, your should is able to get what your need from the shiny new Events API instead. "," Doc Umentation_url ":" Https://developer.github.com/v3/activity/events/#list-public-events "}
Requests automatically decodes content from the server. Most Unicode character sets can be decoded seamlessly. After the request is issued, requests is based on the HTTP header's encoding of the response. When you visit R.text, requests uses its speculative text encoding.
Feeling requests is very simple and easy to read, feeling and selenium almost, directly to a get (URL) is finished, and then caught directly with. Text printing. Binary response (for non-text)
If you manipulate the text, return the STR type, and there is no egg difference.
Import requests
url= ' Https://github.com/timeline.json '
html_content = requests.get (URL). Content
Print Type (html_content)
print html_content
<type ' str ' >
{"message": "Hello there, Wayfaring stranger." If you are reading this then you probably didn ' t, we blog post a couple of years back announcing the this API would go Away:http://git.io/17arog fear not, your should is able to get what your need from the shiny new Events API instead. "," Doc Umentation_url ":" Https://developer.github.com/v3/activity/events/#list-public-events "}
If you are working on an image, see
We import PIL module image, there are stringio to read, try, test site is a cat slave website URL http://placekitten.com/500/700, the page should be opened after this.
Stringio Introduction (Random)
Because the file object and Stringio most of the methods are the same, such as read, ReadLine, ReadLines, write, writelines are all, so that Stringio can be very convenient as a "Memory file object."
First, we use the traditional urllib2 to try to crawl,
Import urllib2 from
pil import Image from
stringio import stringio
url = ' http://placekitten.com/500/700 ' C3/>req = Urllib2. Request (URL)
response = Urllib2.urlopen (req)
html = response.read ()
print type (HTML)
print HTML
i = Image.open (Stringio (HTML))
I.show ()
<type ' str ' >
? JFIF??? ? ; Creator:gd-jpeg v1.0 (using IJG JPEG v62), quality =
C ... Omit N garbled
Use the effect as shown
Then we use Requests,content to crawl.
From PIL import Image from
stringio import stringio
import requests
url = ' Http://placekitten.com/500/
html_content = requests.get (URL). Content#<type ' str ' >
html_text = requests.get (URL) .text#< Type ' Unicode ' >
print type (html_content) print
html_content print
type (html_text)
Print HTML _text
i = Image.open (Stringio (html_content))
i.show ()
<type ' str ' >
? JFIF??? ? ; Creator:gd-jpeg v1.0 (using IJG JPEG v62), quality =
C ... Omit n garbled
<type ' Unicode ' >
? JFIF??? ? ; Creator:gd-jpeg v1.0 (using IJG JPEG v62), quality =
C ... Omit N garbled
The effect, like the one above, can run successfully, but if you replace the Str stream with this
i = Image.open (Stringio (Html_text))
Obviously will be an error, you can refer to @ Green south of the Small World –requests content and text cause lxml resolution problem
Unicodeencodeerror: ' ASCII ' codec can ' t encode characters in position 0-3: Ordinal No in range (128)
Req.text returns a Unicode type of data that uses Req.content to return a bytes type of data. That is, when using req.content, you have already brought the source code into a bit array and then converted the bit array into a bit object. responding to JSON content
Requests also has a built-in JSON decoder that helps you process JSON data
Import requests
url= ' Https://github.com/timeline.json '
html_json = requests.get (URL). JSON ()
print type (Html_json)
Print Html_json
Returns a dict type
<type ' Dict ' >
{
u ' documentation_url ': U ' https://developer.github.com/v3/activity/events/# List-public-events ',
u ' message ': U ' Hello there, wayfaring stranger. If You\u2019re Reading This then for you probably didn\u2019t your blog post a couple of years back announcing the this AP I would go Away:http://git.io/17arog fear is not, you should being able to get what your need from the shiny new Events API Inst EAD. '}
You don't know anything about JSON, you can see what it is.
JSON (JavaScript Object notation) is a lightweight data interchange format. It is based on a subset of ECMAScript. JSON uses a completely language-independent text format, but it also uses a family of C-language (c, C + +, C #, Java, JavaScript, Perl, Python, and so on). These features make JSON an ideal data exchange language. Easy to read and write, but also easy to machine parsing and generation (generally used to improve network transmission rate).
OK, there's no real difference between this and the dictionary in Python, it's a lightweight dictionary structure to deal with.
View all keys, values, (key, value) Pairs: Dict.keys (), Dict.values (), Dict.items (), return value type as List
Import requests
url= ' Https://github.com/timeline.json '
html_json = requests.get (URL). JSON ()
print type (Html_json)
Print Html_json.keys ()
for key in Html_json: #遍历字典, default to key
print key for
values in Html_json.values ():
Print values
<type ' Dict ' >
#--------------------------------#
[u ' Documentation_url ', u ' message ']
#-------- ------------------------#
Documentation_url
Message
#--------------------------------
# https://developer.github.com/v3/activity/events/#list-public-events
Hello there, Wayfaring stranger. If you are reading this then you probably didn ' t, we blog post a couple of years back announcing the this API would go Away:http://git.io/17arog fear not, your should is able to get what your need from the shiny new Events API instead.
Well, it's off the topic .... Custom Request Headers
If you want to add HTTP headers to the request, simply pass a dict to the headers parameter.
Here you need to import JSON, which is related to the dictionary, should think of JSON, otherwise I write so many json what to do ...
Import requests,json
url = ' https://api.github.com/some/endpoint '
payload = {' Some ': ' Data '}
headers = {' Content-type ': ' Application/json '}
Html_json = Requests.post (url,data=json.dumps (payload), headers=headers)
print Html_json
print Html_json.text
<response [404]>
{' message ': ' Not Found ', ' documentation_url ': ' Https://developer.github.com/v3 '}
SO,WTF, why is the simulation experiment unsuccessful? @ Reverse driving –python requests-Learning Notes (4)-Custom request headers and post, 404 errors, the specified page does not exist AH hello, changed a Web site, this http://httpbin.org/post
Import requests,json
url = ' http://httpbin.org/post '
payload = {' Some ': ' Data '}
headers = {' Content-type ' : ' Application/json '}
Html_json = Requests.post (url,data=json.dumps (payload), headers=headers)
print HTML _json
Print Html_json.text
<response [200]>
{
"args": {},
"data": "{\" some\ ": \" Data\ "}",
"files": {},
"form": {},
"Headers": {"
Accept": "*/*",
"accept-encoding": "gzip, deflate",
"Content-length": ",
" Added Content-type
"Content-type": "Application/json",
"Host": "httpbin.org",
"user-agent": " python-requests/2.10.0 "
},
#增加了json
" json ": {
" some ":" Data "
},
" origin ":" 221.212.116.44 ",
url": "Http://httpbin.org/post"
}
Add Form
Form label: Used to create an HTML form.
Import requests,json
url = ' http://httpbin.org/post '
payload = {' Key1 ': ' values1 ', ' key2 ': ' Values2 '}
Html_json = Requests.post (url,data=payload)
print Html_json
print Html_json.text
The result is
<response [200]>
{
"args": {},
"Data": "",
"files": {},
#增加了表单
"form": {"
Key1": " Values1 ",
" Key2 ":" Values2 "
},
" headers ": {"
Accept ":" */* ",
" accept-encoding ":" Gzip, Deflate ",
" content-length ":" ","
content-type ":" application/x-www-form-urlencoded ",
" Host ":" Httpbin.org ",
" user-agent ":" python-requests/2.10.0 "
},
" JSON ": null,
" origin ":" 221.212.116.44 ",
url": "Http://httpbin.org/post"
}
post a multi-part encoding (multipart-encoded) file
Here I added a previously written record in TXT file for a passage
Import requests,json
url = ' Http://httpbin.org/post '
files = {' file ': Open (' Inception.txt ', ' RB ')}
Html_file = Requests.post (url,files=files)
print html_file
print Html_file.text
<response [200]>
{
"args": {},
"Data": "",
"files": {
"file": "\r\ n-------------------------------------\u6211\u662f\u5206\u5272\u7ebf-----------------------------------------\ R\ninception ........ Omit n "
},
" form ": {},"
headers ": {" Accept ":
" */* ",
" accept-encoding ":" gzip, deflate ",
" Content-length ":" 32796 ","
content-type ":" Multipart/form-data; boundary=a4ba16fec9054637b7cb6f264013988b ","
Host ":" httpbin.org ",
" user-agent ":" python-requests/ 2.10.0 "
},
" JSON ": null,
" origin ":" 221.212.116.44 ",
" url ":" Http://httpbin.org/post "
}
Now that I've just learned json, I'm going to save it in JSON, and then use the dictionary query to see if I can upload the file as a local file
Add under the above procedure:
Print Html_file.json ()
print Html_file.json () [' Files '] [' file '] #字典取value的结构
The result is
{u ' files ': {u ' file ': U ' \ r \ n-------------------------------------\u6211\u662f\u5206\u5272\u7ebf---------......... Omit n
-------------------------------------I'm a split line-----------------------------------------
Inception The plot logic completely resolves (has not understood the place to enter, has not seen the not to enter) ... Omit n Characters
So it turns out, I uploaded it, but it was encoded as Unicode, and the print itself turned Unicode into utf-8, so, the validation was successful, and the transcoding gadget transcoding tool was recommended. another digression
When I write a new txt (test that TXT is machine written), like this, and then upload, after grasping down, found has been bsae64 encoded,
Import requests,json
url = ' Http://httpbin.org/post '
files = {' file ': Open (' post_file.txt ', ' RB ')}
Html_ File = Requests.post (url,files=files)
print html_file.text
print Html_file.json () [' Files '] [' file ']
The result is this.
{
"args": {},
"Data": "",
"files": {
"file": "Data:application/octet-stream;base64, 1elkx9k7upay4sruo6encnroaxmgaxmgysb0zxn0o6e= "
},"
form ": {},
" headers ": {
" Accept ":" */* ",
" Accept-encoding ":" gzip, deflate ",
" content-length ":" 181 ",
" Content-type ":" Multipart/form-data; " Boundary=6cd3e994e14d428e9df61d7e1aade15e ","
Host ":" httpbin.org ",
" user-agent ":" python-requests/ 2.10.0 "
},
" JSON ": null,
" origin ":" 221.212.116.44 ",
" url ":" Http://httpbin.org/post "
}
data:application/octet-stream;base64,1elkx9k7upay4sruo6encnroaxmgaxmgysb0zxn0o6e=
Not as I wish, directly printed out the TXT file I uploaded, but also need to decode, first look at the content is not what I want, on the gadget Base64 online codec UTF-8
OK, the description is really what I want, just be coded, then try to decode
Print Base64.b64decode (' 1elkx9k7upay4sruo6encnroaxmgaxmgysb0zxn0o6e= ')
The result is
Һԣ This is
a test
So, WTF, what's The moth.
Good base64 decoding it ... How Chinese and exclamation point is also garbled code. Does it have something to do with me writing the txt time code ... God bother coding ...
To test again.
Import base64
s = ' 1elkx9k7upay4sruo6encnroaxmgaxmgysb0zxn0o6e= '
h = ' This is a test. This is a test! '
f = Base64.b64encode (h)
print F
print base64.b64decode (f)
Output
6L+Z5PIV5LIA5LIQ5RWL6K+V77YBDGHPCYBPCYBHIHRLC3QH
This is a test. This is a test!
That's doable.
Put it under that tool and look at it.
Okay, I feel like I'm being insulted. No, next time I know. Add two
The above garbled problem occurs when I edit TXT file using the ANSI code
Thanks to [Python crawler] Chinese coding problem: raw_input input, file reading, variable comparison, such as STR, Unicode, Utf-8 conversion issues inspired, I saw the original is so, so I will save the file format into a utf-8, the problem solved smoothly, And they don't have to do base64 decoding.
Import requests
url = ' Http://httpbin.org/post '
files = {