Python crawler (requests), pythonrequests
I believe that most of the people who started learning Python crawlers initially used urllib and urllib2. After that, I came into contact with the third-party library requests. requests can fully meet various http functions and is really easy to use: D
They say this:
"Requests is the only non-GMO Python HTTP library that can be securely used by humans. Requests allows you to send HTTP/1.1 Requests for pure natural and plant feeding without manual work. You do not need to manually add query strings for the URL or perform form encoding on POST data. The functions of the Keep-alive and HTTP connection pools are 100% automated, and all motivation comes from the urllib3 rooted in Requests ."
----- From official documentation (http://cn.python-requests.org/zh_CN/latest)
Run the "Pip Install Requests" command to Install pip (if Pip is installed)
What are you waiting? Import requests to join the luxury lunch
Let's take a look at several common methods and attributes:
1. requests. Session () so that the Session can be retained and cookie can be kept.
2. requests. get () to get a webpage. You can use the params parameter to send some data to get it.
d = {key1 : value1, key2 : value2 }requests.get(‘URL’, params=d)
You can also use the headers parameter to customize the request header during get.
h = {key1 : value1, key2 : value2 }requests.get(‘URL’, headers=d)
3. requests. post () sends post requests. Similarly, you can also send data (using the data parameter) and custom request headers (using the headers parameter) during post ).
Some common attributes:
Eg = requests. get () eg. text # The response content can be obtained, for example, the captured webpage eg. encoding = 'utf-8' # Sometimes garbled characters are returned, and the encoding is changed to make it display normally. According to the actual situation, the encoding such as UTF-8 and gb2312 is changed. content # You can obtain binary content, for example, capture the verification code at login and other non-character resources eg. cookies # You can view the currently saved cookies. status_code # You can view the HTTP status code (such as 200 OK and 404 Not Found) eg. url # view the url of the current request
For more details, see the official documentation (http://cn.python-requests.org/zh_CN/latest)
Well, you only need to know a little bit about crawler.
An interesting phenomenon: students go to a website called "Academic Affairs Office" when learning crawlers, haha. The crawlers here are also used as examples to log on to the academic affairs office of the school (Chengdu Information Engineering University ).
First, open the Academic Affairs Office in the browser, press F12 to open the "Developer tool", perform a normal login, and analyze the login data.
1. the logon page of the Academic Affairs Office is http: // 210.41.224.117/Login/xLogin/Login. asp.
2. Click the network in the developer tool. The address of the post data sent after Login is also http: // 210.41.224.117/Login/xLogin/Login. asp.
3. At the same time, the post data includes the following:
Parameter List |
Form name |
Example |
Description |
WinW |
1366 |
Screen Resolution-Width |
WinH |
728 |
Screen Resolution-height |
TxtId |
2013215042 |
Student ID |
TxtMM |
123456 |
Password |
Verifycode |
123a |
Verification Code |
CodeKey |
597564 |
Dynamic login code, which is visible in html files |
Login |
Check |
Login type (fixed) |
IbtnEnter. x |
10 |
Login button click location |
IbtnEnter. y |
10 |
Login button click location |
Login post form data in "Developer Tools:
# Coding = utf-8import requestsimport reimport timeimport randomfrom PIL import Imageimport cStringIOdef login (username, password): headers = {# The Request Header request refreshes the verification code and uses 'host' when sending the post ': '2017. 41.224.117 ', 'user-agent': 'mozilla/5.0 (Windows NT 10.0; WOW64; rv: 48.0) Gecko/20100101 Firefox/123456', 'accept ': '*/*', 'Accept-Language ': 'zh-CN, zh; q = 0.8, en-US; q = 0.5, en; q = 0.3 ', 'Accept-encoding': 'gzip, deflate', 'Referer': 'Http: // 210.41.224.117/Login/xLogin/Login. asp ', 'connection': 'Keep-alive'} session = requests. session () step1 = session. get ('HTTP: // jxgl.cuit.edu.cn/JXGL/xs/MainMenu.asp') # connect to the student homepage twice to jump to the login page Step 1 = session. get ("http://jxgl.cuit.edu.cn/Jxgl/Xs/MainMenu.asp") get_osid_url = re. compile (r'content = "0; URL = (. *?) "> ') # Get the jump URL osid_url with OSid = get_osid_url.findall (step1.text) step2 = session. get (osid_url [0]) # Jump, point 1 get_codeKey = re. compile (r'var codeKey = \'(. *?) \ ';') # Obtain codeKey (parameter k) codeKey = get_codeKey.findall (step2.text) timeKey = str (time. time () [: 10] + str (random. randint (100,999) # generate the value of parameter t (timestamp + three random numbers) payload = {'K': codeKey [0], 't': timeKey} yzm_url = 'HTTP: // 210.41.224.117/Login/xLogin/yzmDvCode. asp 'yzmdata = session. get (yzm_url, params = payload, headers = headers) # refresh the verification code. Point 2 tempIm = cStringIO. stringIO (yzmdata. content) im = Image. open (tempIm) im. show () yzm = raw_input ('Please enter yzm: ') # enter post_data = {'winw': '000000', 'winh': '000000 ', 'txtid': username, 'txtmm': password, 'verifycode': yzm, 'codekey': codeKey [0], 'login': 'check', 'ibtnenter. x': 10, 'ibtnenter. y ': 10} post_url = 'HTTP: // 210.41.224.117/Login/xLogin/Login. asp 'step3 = session. post (post_url, data = post_data, headers = headers) # return sessioncuitJWC = login ('username', 'Password') con = cuitJWC. get ('HTTP: // jxgl.cuit.edu.cn/JXGL/xs/MainMenu.asp') con. encoding = 'gb2312' print con. text
Reprinted please indicate the source: http://www.cnblogs.com/lucky-pin/p/5806394.html