Crawler Core idea: Simulate the browser normal access to the server, the general situation as long as the browser can access, can crawl, if the anti-crawling, then consider the repeated test to add the request header data, know that can crawl so far.
Anti-crawling ideas are now known: user-agent,cookie,referer, access speed, verification code, user login and front-end JS code verification. This example encounters JS validation user-agent Referer Cookie total of 4 anti-crawl mechanisms.
The key part is the construction of parameter headers and data, headers to be repeated testing, the variables inside the data to find ideas.
Resources:
Using Python to decipher Youdao translation anti-crawler mechanism 75294947
Python cracked netease anti-crawler mechanism 79522067
Some anti-crawl mechanisms 79841901
Youdao translation page, the left input to translate the string, the right side will automatically output the results of the translation, such as
After several input character tests, we found that the page was not refreshed, guess that Ajax might be used, and then grab the packet and discover that the data was actually transmitted using AJAX
The
Code is as follows:
#!/usr/bin/env python#-*-coding:utf-8-*-import urllibimport urllib2import timeimport hashliburl = '/http/ Fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule ' keyword = raw_input (' Please enter a string to translate: ') # Headers function Simulation Browser headers = {# "Accept": "Application/json, Text/javascript, */*; q=0.01 ", #" Connection ":" Keep-alive ", #" Content-type ":" application/x-www-form-urlencoded; Charset=utf-8 "," Cookie ":" Your browser Cookie value "," Referer ":" http://fanyi.youdao.com/"," user-agent ":" Mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.87 safari/537.36 ", #" X-requested-with ":" XMLHttpRequest ", }salt = str (int (time.time () *1000)) m = hashlib.md5 () str = "Fanyideskweb" + keyword + salt + "Ebsefb%=xz%t[kz" C (sy! ") M.update (str) sign = M.hexdigest (). Encode (' Utf-8 ') the print (sign) # data is the POST request for the database: {"I", keyword, "from": "AUTO", " To ': ' AUTO ', ' smartresult ': ' dict ', ' client ': ' Fanyideskweb ', ' salt ': salt, ' sign ': sign, ' doctype ': ' JSON ', ' Version ":" 2.1 "," Keyfrom ":" Fanyi.web "," Action ":" Fy_by_realtime "," Typoresult ":" false "}# encode post-uploaded data UrlEncode . UrlEncode (data) # Urllib can only accept URLs, cannot create a Request class instance, and cannot set parameter headers, but can encode URLs, and urllib2 cannot encode, so it's often used together # The Urllib2.urlopen (URL) cannot construct a complex request, so use URLLIB2. Request (Url,data=data,headers=headers), 2 is a data parameter when the post submission data, the value of headers to mimic the browser in the request header, in the form of a dictionary, The data that is accepted by the server looks like it is accessed using a browser. Request = Urllib2. Request (url,data=data,headers=headers) response = Urllib2.urlopen (request) print (Response.read ())
Code tests, such as
Use Python2 to crawl Youdao translations