Recently climbed Amazon, found that the previously written crawler is relatively "thin", crawling static stations or AJAX Web site can also be used,
For more witty services, such as Amazon Web sites, the crawler will often fail after more than 100 consecutive visits.
Looked for the reason for the failure, and found that Amazon detects the IP once the request is too many times, will jump to a detection whether the program on the operation of the page,
is to enter the verification code of the webpage, enter the correct verification code, you can continue to enjoy the visit, get rid of the verification code is a very troublesome work, determined and verification code against IS and oneself to pass ...
GG a way to tidy up a better solution.
1.ADSL Restart dialing
We all know that ADSL redial, will be replaced by a new IP address, then you can write a script to set the time to redial ADSL, or crawl with a crawler, found to start to jump to the Verification code page, and then call redial ADSL script
2. Crawl Proxy server address
Proxy server can also be better to solve the problem of IP is blocked, I believe that we have a better proxy server site it ~
Proxy Server website domain name often replaced, I do not provide, we own GG Bar, ferry may also have a surprise ~
Write regular crawl proxy server, pay attention to crawl finish must detect if available!
The code can refer to the following blog's checkproxy () function
Https://blog.linuxeye.com/410.html
Want to know my Amazon crawler?
Code under the GitHub spider-comments project Amazon-spider-comments
Https://github.com/fankcoder/spider-comments
Crawler Anti-Drop line