Agricultural University Library-news announcement anti-crawler and Agricultural University Crawler
1, address: http://lib.henau.edu.cn/Default/go? SortID = 109
Anti-crawler mechanism, through cookie value. 1st requests to this address will check the cookie. If there is no corresponding cookie, the cookie value will be set through js first. Request the page again.
This is the document returned for 1st requests to this page. We can see the cookie setting code in js,
document|href|location|cookie|ant_stream_58b3fe214a7d4|path|3252469838|1496243372
<Html>
3. Some corresponding python codeHeaders = {'user-agent': 'mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) chrome/59.0.3071.25 Safari/537.36 '} opener = urllib2.build _ opener () # request to obtain the script request = urllib2.Request (url, headers = headers) html = opener in the cookie for the first time. open (request) soup = bs4.BeautifulSoup (html, 'html. parser ') scriptCookie = str (soup. find ('script') start = scriptCookie. index ('cookies') end = scriptCookie. index ("'. split (") strs = scriptCookie [start: end]. split ('|') opener. addheaders. append ('cookie ',' % s = % s/% s' % (strs [1], strs [4], strs [3]) html = opener. open (request)