How to disguise and escape anti-crawler programs in python web crawler
Sometimes, the crawler code we have written is still running well, And suddenly an error is reported.
The error message is as follows:
Http 800 Internal internet error
This is because your object website has configured anti-crawler programs. If you use the existing crawler code, it will be rejected.
The previous normal crawler code is as follows:
From urllib. request import urlopen..html = urlopen (scrapeUrl) bsObj = BeautifulSoup (html. read (), "html. parser ")
At this time, we need to disguise our crawler code,
Add a header to it to disguise it as a request from a browser.
The modified code is as follows:
Import urllib. parseimport urllib. requestfrom bs4 import BeautifulSoup... req = urllib. request. request (scrapeUrl) req. add_header ('user-agent', 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ') response = urllib. request. urlopen (req) html = response. read () bsObj = BeautifulSoup (html, "html. parser ")
OK, everything is done, and you can continue crawling.
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.