Python prevents IP from being blocked when crawling large amounts of data

Source: Internet
Author: User
Tags xpath

Continue the old routine, these two days I climbed the pig some data URLs are: http://task.zbj.com/t-ppsj/p1s5.html, probably because the amount of data crawled is a bit more, the result of my IP was blocked, I need to manually verify the IP, but this obviously prevented me from crawling more data.

Here is the code that I wrote to crawl pig's blocked IP

# coding=utf-8import requestsfrom lxml import etreedef getUrl (): For I in range: url = ' http://task.zbj.com/ t-ppsj/p{}s5.html '. Format (i+1) spiderpage (URL) def spiderpage (URL): If URL is none:return None Htmltex t = requests.get (URL). text selector = etree.            HTML (htmltext) tds = Selector.xpath ('//*[@class = "Tab-switch tab-progress"]/table/tr ') try:for TD in TDs: Price = Td.xpath ('./td/p/em/text () ') href = Td.xpath ('./td/p/a/@href ') title = Td.xpath ('./t            D/p/a/text () ') SubTitle = Td.xpath ('./td/p/text () ') deadline = Td.xpath ('./td/span/text () ')  Price = Price[0] If len (price) >0 Else ' # Python's three mesh operation: The result of true if the condition is determined and the result is false when title = Title[0] If Len (title) >0 Else ' href = href[0] If len (href) >0 Else ' subTitle = subtitle[0] If Len (subt Itle) >0 Else ' deadline = deadline[0] If Len (deadline) >0 Else ' Print Price,title,href,subtitle,deadline print '--------------------------------------------------------------------        -------------------' spiderdetail (HREF) except:print ' ERROR ' def spiderdetail (URL): If URL is None: return None try:htmltext = requests.get (URL). text selector = etree. HTML (htmltext) abouthref = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href ') pri CE = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/text () ') title = Selector.xpath ('//*[@id = ') Utopia_widget_10 "]/div[1]/div/div/h2/text ()") Contentdetail = Selector.xpath ('//*[@id = ' utopia_widget_10 ']/div[2]/        Div/div[1]/div[1]/text () ') Publishdate = Selector.xpath ('//*[@id = ' utopia_widget_10 ']/div[2]/div/div[1]/p/text () ')  Abouthref = abouthref[0] If len (abouthref) > 0 Else ' # Python's three-mesh operation: The result of true if the condition is determined if it is false, the result of the Price = Price[0] If Len (price) > 0 Else ' title = TitlE[0] If Len (title) > 0 Else ' contentdetail = contentdetail[0] If len (contentdetail) > 0 Else ' Publ    Ishdate = publishdate[0] If len (publishdate) > 0 Else ' Print abouthref,price,title,contentdetail,publishdate Except:print ' ERROR ' if ' _main_ ': GetUrl ()

I found that after the code runs, there are a few pages behind the data is not crawled, I have no way to visit the pig site, and so on for some time to visit their website, it is very embarrassing, I have to prevent the IP

How to prevent crawling of data when the site is blocked by IP there are some routines here. Got some routines.

1. Modify the request header

Before the crawler code did not add the head, here I added the head, simulated into a browser to visit the site

        user_agent = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400‘        headers = {‘User-Agent‘: user_agent}        htmlText = requests.get(url, headers=headers, proxies=proxies).text
2. Using Proxy IP

When their IP is blocked by the site, can only use proxy IP crawl, so every time you crawl to use proxy IP to crawl, sealed agent and agent.

Here I quote a piece of code from this blog to generate an IP address:http://blog.csdn.net/lammonpeter/article/details/52917264

Generate proxy IP, you can directly take this code to use

# coding=utf-8# IP address from domestic GAO anonymous proxy IP site: http://www.xicidaili.com/nn/# just crawl home IP address is sufficient for general use from BS4 import Beautifulsoupimport Requestsimport randomdef get_ip_list (URL, headers): Web_data = Requests.get (URL, headers=headers) soup = beautifulso Up (Web_data.text, ' lxml ') ips = Soup.find_all (' tr ') ip_list = [] for I in range (1, Len (IPS)): Ip_info = IP S[i] TDs = Ip_info.find_all (' TD ') Ip_list.append (Tds[1].text + ': ' + tds[2].text) return ip_listdef Get_r Andom_ip (ip_list): Proxy_list = [] for IP in Ip_list:proxy_list.append (' http://' + IP) proxy_ip = random. Choice (proxy_list) proxies = {' http ': proxy_ip} return proxiesif __name__ = ' __main__ ': url = ' Http://www.xicida ili.com/nn/' headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 '} ip_list = get_ip_list (URL, headers =headers) proxies = get_random_ip (ip_list) print (proxies)

Okay, I'll use the code above to generate a batch of IP addresses (some IP addresses may be invalid, but as long as you do not block my own IP, haha), then I can add the IP address on my request header

Add a proxy IP to our request

        proxies = {            ‘http‘: ‘http://124.72.109.183:8118‘,            ‘http‘: ‘http://49.85.1.79:31666‘        }        user_agent = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400‘        headers = {‘User-Agent‘: user_agent}        htmlText = requests.get(url, headers=headers, timeout=3, proxies=proxies).text

What we know now is

The final complete code is as follows:

# coding=utf-8import Requestsimport timefrom lxml import etreedef getUrl (): For I in range: url = ' Http://ta     sk.zbj.com/t-ppsj/p{}s5.html '. Format (i+1) spiderpage (URL) def spiderpage (URL): If URL is None:return None try:proxies = {' http ': ' http://221.202.248.52:80 ',} user_agent = ' mozilla/5.0 (wind OWS NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.104 safari/537.36 core/1.53.4295.400 ' headers = {' Us Er-agent ': user_agent} htmltext = Requests.get (URL, headers=headers,proxies=proxies). Text selector = Etree.H            TML (htmltext) tds = Selector.xpath ('//*[@class = "Tab-switch tab-progress"]/table/tr ') for TD in TDs: Price = Td.xpath ('./td/p/em/text () ') href = Td.xpath ('./td/p/a/@href ') title = Td.xpath ('./td/p/a /text () ') SubTitle = Td.xpath ('./td/p/text () ') deadline = Td.xpath ('./td/span/text () ') PR Ice = Price[0] If Len (price) >0 Else ' # Python's three mesh operation: The result of true if the condition is determined if the conditions are false when the result of title = Title[0] If Len (title) >0 Else ' href = href[0] If len (href) >0 Else ' subTitle = subtitle[0] If Len (subTitle) >0 Else            ' deadline = deadline[0] If Len (deadline) >0 Else ' Print Price,title,href,subtitle,deadline print '---------------------------------------------------------------------------------------' spiderd     Etail (href) except Exception,e:print ' Error ', e.messagedef spiderdetail (URL): If URL is None:return None Try:htmltext = Requests.get (URL). text selector = etree. HTML (htmltext) abouthref = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href ') pri CE = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/text () ') title = Selector.xpath ('//*[@id = ') Utopia_widget_10 "]/div[1]/div/div/h2/text ()") Contentdetail = SelEctor.xpath ('//*[@id = "utopia_widget_10"]/div[2]/div/div[1]/div[1]/text () ') Publishdate = Selector.xpath ('//*[@id = "Utopia_widget_10"]/div[2]/div/div[1]/p/text () ') Abouthref = abouthref[0] If len (abouthref) > 0 Else ' # Pytho N Three-mesh operation: The result of true if the condition is determined if the conditions else is false when the result of Price = Price[0] If len (price) > 0 Else ' title = Title[0] If len (tit Le) > 0 Else ' contentdetail = contentdetail[0] If len (contentdetail) > 0 Else ' publishdate = Publi      Shdate[0] If Len (publishdate) > 0 Else ' Print abouthref,price,title,contentdetail,publishdate except: print ' ERROR ' if ' _main_ ': GetUrl ()

Finally, the program works perfectly, and no more IP cases are seen. Of course, to prevent the IP must be more than this, it needs further exploration!

At last

Of course the data I have crawled over, but my data are not perfect, I should write to the Execl file, or database Ah, so as to facilitate the adoption. So I'm going to use
Python Operation Execl,mysql,mongodb

Python prevents IP from being blocked when crawling large amounts of data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.