Continue the old routine, these two days I climbed the pig some data URLs are:
http://task.zbj.com/t-ppsj/p1s5.html
, probably because the amount of data crawled is a bit more, the result of my IP was blocked, I need to manually verify the IP, but this obviously prevented me from crawling more data.
Here is the code that I wrote to crawl pig's blocked IP
# coding=utf-8import requestsfrom lxml import etreedef getUrl (): For I in range: url = ' http://task.zbj.com/ t-ppsj/p{}s5.html '. Format (i+1) spiderpage (URL) def spiderpage (URL): If URL is none:return None Htmltex t = requests.get (URL). text selector = etree. HTML (htmltext) tds = Selector.xpath ('//*[@class = "Tab-switch tab-progress"]/table/tr ') try:for TD in TDs: Price = Td.xpath ('./td/p/em/text () ') href = Td.xpath ('./td/p/a/@href ') title = Td.xpath ('./t D/p/a/text () ') SubTitle = Td.xpath ('./td/p/text () ') deadline = Td.xpath ('./td/span/text () ') Price = Price[0] If len (price) >0 Else ' # Python's three mesh operation: The result of true if the condition is determined and the result is false when title = Title[0] If Len (title) >0 Else ' href = href[0] If len (href) >0 Else ' subTitle = subtitle[0] If Len (subt Itle) >0 Else ' deadline = deadline[0] If Len (deadline) >0 Else ' Print Price,title,href,subtitle,deadline print '-------------------------------------------------------------------- -------------------' spiderdetail (HREF) except:print ' ERROR ' def spiderdetail (URL): If URL is None: return None try:htmltext = requests.get (URL). text selector = etree. HTML (htmltext) abouthref = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href ') pri CE = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/text () ') title = Selector.xpath ('//*[@id = ') Utopia_widget_10 "]/div[1]/div/div/h2/text ()") Contentdetail = Selector.xpath ('//*[@id = ' utopia_widget_10 ']/div[2]/ Div/div[1]/div[1]/text () ') Publishdate = Selector.xpath ('//*[@id = ' utopia_widget_10 ']/div[2]/div/div[1]/p/text () ') Abouthref = abouthref[0] If len (abouthref) > 0 Else ' # Python's three-mesh operation: The result of true if the condition is determined if it is false, the result of the Price = Price[0] If Len (price) > 0 Else ' title = TitlE[0] If Len (title) > 0 Else ' contentdetail = contentdetail[0] If len (contentdetail) > 0 Else ' Publ Ishdate = publishdate[0] If len (publishdate) > 0 Else ' Print abouthref,price,title,contentdetail,publishdate Except:print ' ERROR ' if ' _main_ ': GetUrl ()
I found that after the code runs, there are a few pages behind the data is not crawled, I have no way to visit the pig site, and so on for some time to visit their website, it is very embarrassing, I have to prevent the IP
How to prevent crawling of data when the site is blocked by IP there are some routines here. Got some routines.
1. Modify the request header
Before the crawler code did not add the head, here I added the head, simulated into a browser to visit the site
user_agent = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400‘ headers = {‘User-Agent‘: user_agent} htmlText = requests.get(url, headers=headers, proxies=proxies).text
2. Using Proxy IP
When their IP is blocked by the site, can only use proxy IP crawl, so every time you crawl to use proxy IP to crawl, sealed agent and agent.
Here I quote a piece of code from this blog to generate an IP address:http://blog.csdn.net/lammonpeter/article/details/52917264
Generate proxy IP, you can directly take this code to use
# coding=utf-8# IP address from domestic GAO anonymous proxy IP site: http://www.xicidaili.com/nn/# just crawl home IP address is sufficient for general use from BS4 import Beautifulsoupimport Requestsimport randomdef get_ip_list (URL, headers): Web_data = Requests.get (URL, headers=headers) soup = beautifulso Up (Web_data.text, ' lxml ') ips = Soup.find_all (' tr ') ip_list = [] for I in range (1, Len (IPS)): Ip_info = IP S[i] TDs = Ip_info.find_all (' TD ') Ip_list.append (Tds[1].text + ': ' + tds[2].text) return ip_listdef Get_r Andom_ip (ip_list): Proxy_list = [] for IP in Ip_list:proxy_list.append (' http://' + IP) proxy_ip = random. Choice (proxy_list) proxies = {' http ': proxy_ip} return proxiesif __name__ = ' __main__ ': url = ' Http://www.xicida ili.com/nn/' headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 '} ip_list = get_ip_list (URL, headers =headers) proxies = get_random_ip (ip_list) print (proxies)
Okay, I'll use the code above to generate a batch of IP addresses (some IP addresses may be invalid, but as long as you do not block my own IP, haha), then I can add the IP address on my request header
Add a proxy IP to our request
proxies = { ‘http‘: ‘http://124.72.109.183:8118‘, ‘http‘: ‘http://49.85.1.79:31666‘ } user_agent = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400‘ headers = {‘User-Agent‘: user_agent} htmlText = requests.get(url, headers=headers, timeout=3, proxies=proxies).text
What we know now is
The final complete code is as follows:
# coding=utf-8import Requestsimport timefrom lxml import etreedef getUrl (): For I in range: url = ' Http://ta sk.zbj.com/t-ppsj/p{}s5.html '. Format (i+1) spiderpage (URL) def spiderpage (URL): If URL is None:return None try:proxies = {' http ': ' http://221.202.248.52:80 ',} user_agent = ' mozilla/5.0 (wind OWS NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.104 safari/537.36 core/1.53.4295.400 ' headers = {' Us Er-agent ': user_agent} htmltext = Requests.get (URL, headers=headers,proxies=proxies). Text selector = Etree.H TML (htmltext) tds = Selector.xpath ('//*[@class = "Tab-switch tab-progress"]/table/tr ') for TD in TDs: Price = Td.xpath ('./td/p/em/text () ') href = Td.xpath ('./td/p/a/@href ') title = Td.xpath ('./td/p/a /text () ') SubTitle = Td.xpath ('./td/p/text () ') deadline = Td.xpath ('./td/span/text () ') PR Ice = Price[0] If Len (price) >0 Else ' # Python's three mesh operation: The result of true if the condition is determined if the conditions are false when the result of title = Title[0] If Len (title) >0 Else ' href = href[0] If len (href) >0 Else ' subTitle = subtitle[0] If Len (subTitle) >0 Else ' deadline = deadline[0] If Len (deadline) >0 Else ' Print Price,title,href,subtitle,deadline print '---------------------------------------------------------------------------------------' spiderd Etail (href) except Exception,e:print ' Error ', e.messagedef spiderdetail (URL): If URL is None:return None Try:htmltext = Requests.get (URL). text selector = etree. HTML (htmltext) abouthref = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href ') pri CE = Selector.xpath ('//*[@id = "utopia_widget_10"]/div[1]/div/div/div/p[1]/text () ') title = Selector.xpath ('//*[@id = ') Utopia_widget_10 "]/div[1]/div/div/h2/text ()") Contentdetail = SelEctor.xpath ('//*[@id = "utopia_widget_10"]/div[2]/div/div[1]/div[1]/text () ') Publishdate = Selector.xpath ('//*[@id = "Utopia_widget_10"]/div[2]/div/div[1]/p/text () ') Abouthref = abouthref[0] If len (abouthref) > 0 Else ' # Pytho N Three-mesh operation: The result of true if the condition is determined if the conditions else is false when the result of Price = Price[0] If len (price) > 0 Else ' title = Title[0] If len (tit Le) > 0 Else ' contentdetail = contentdetail[0] If len (contentdetail) > 0 Else ' publishdate = Publi Shdate[0] If Len (publishdate) > 0 Else ' Print abouthref,price,title,contentdetail,publishdate except: print ' ERROR ' if ' _main_ ': GetUrl ()
Finally, the program works perfectly, and no more IP cases are seen. Of course, to prevent the IP must be more than this, it needs further exploration!
At last
Of course the data I have crawled over, but my data are not perfect, I should write to the Execl file, or database Ah, so as to facilitate the adoption. So I'm going to use
Python Operation Execl,mysql,mongodb
Python prevents IP from being blocked when crawling large amounts of data