Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

Last Update:2016-05-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, let's keep you waiting. Originally intended to be updated 520 that day, but a fine thought, also only I such a single dog still doing scientific research, we may not mind to see the updated article, so dragged to today. But busy 521,522 this day and a half, I have added the database, fixed some bugs (now someone will say that really is a single dog).

Well, don't say much nonsense, let's go into today's theme. On two articles scrapy climbed beautiful pictures, we explained the use of scrapy. But just recently, there are enthusiastic friends told me before the program can not crawl to the picture, I guess it should be fried egg net added anti-crawler mechanism. So today is to break the anti-crawler mechanism of the previous agent IP.

Now a lot of Web site anti-crawler One of the practices (of course, there are other tests) is: the detection of an IP repetitive operation, so as to determine whether the crawler or manual. So using proxy IP can break this blockade. As a student party, no money specifically to buy VPN and IP pool, so we use the proxy IP from the network free, basically enough for personal use. Next we'll talk about crawling free IP and verifying the availability of the proxy IP.

Online has a lot of proxy IP site, this time I chose is http://www.xicidaili.com/nn/, we can try other websites, we try to do a large proxy IP pool.

Whether you notice the high-hide two words, high-hidden means: The other server does not know that you use the agent, but also do not know your real IP, so the concealment is very high.

Seriously.

Follow our previous learning crawler, using the Firebug review element to see how to parse the HTML.

Actually is a table, parse inside each row, this is very simple, we use beautifulsoup very easy to parse out.

It should also be noted that the number of pages in the IP table on each page is corresponding to the parameters in the URL. For example, the first page is HTTP://WWW.XICIDAILI.COM/NN/1. This saves us the trouble of turning the page.

Here is the structure of the program:

DB package in Db_helper: the implementation of MongoDB additions and deletions to the search.

Detect in Package Detect_proxy: Verifying the availability of proxy IPs

Proxy_info in the Entity package: object to proxy information

Spider Bag:

Spiderman Realizing the logic of the crawler

Html_downloader implements the crawler HTML downloader

Html_parser implements a crawler HTML parser

Test package: Testing of the sample, not involving program run

main.py: Implementing command-line parameter definitions

Also say the detection: I use http://ip.chinaz.com/getip.aspx as the detection URL, as long as the use of proxy access does not time out, and the response code is 200, we think is a successful agent.

Next run the program to see the effect:

Switch to the project directory under Windows, run Python main.py-h, and see the usage instructions and parameter settings I defined.

Then run Python main.py-c 1 4 (meaning crawl 1-4 pages of IP address):

This time if you want to verify the correctness of the IP: Run Python main.py-t db

It seems that the use of IP is still relatively small, but for the individual is enough.

Take a look at the MongoDB database:

We can use these IPs the next time we crawl the pictures.

The following code for parsing and validation is posted:

1234567891011121314151617181920212223 defparser(self,html_cont): ‘‘‘ :param html_cont: :return: ‘‘‘ ifhtml_cont isNone: return # 使用BeautifulSoup模块对html进行解析 soup =BeautifulSoup(html_cont,‘html.parser‘,from_encoding=‘utf-8‘) tr_nodes =soup.find_all(‘tr‘,class_=True) fortr_node intr_nodes: proxy =proxy_infor() i =0 forth intr_node.children: if th.string !=Noneandlen(th.string.strip()) > 0: proxy.proxy[proxy.proxyName[i]] =th.string.strip() print‘proxy‘,th.string.strip() i +=1 if(i>1): break self.db_helper.insert({proxy.proxyName[0]:proxy.proxy[proxy.proxyName[0]],proxy.proxyName[1]:proxy.proxy[proxy.proxyName[1]]},proxy.proxy)

Verify part of the core code:

123456789101112131415161718192021222324252627282930 defdetect(self): ‘‘‘ http://ip.chinaz.com/getip.aspx 作为检测目标 :return: ‘‘‘ proxys =self.db_helper.proxys.find() badNum =0 goodNum =0 forproxy inproxys: ip =proxy[‘ip‘] port =proxy[‘port‘] try: proxy_host ="http://"+ip+‘:‘+port # response =urllib.urlopen(self.url,proxies={"http":proxy_host}) ifresponse.getcode()!=200: self.db_helper.delete({‘ip‘:ip,‘port‘:port}) badNum +=1 printproxy_host,‘bad proxy‘ else: goodNum +=1 printproxy_host,‘success proxy‘ exceptException,e: printproxy_host,‘bad proxy‘ self.db_helper.delete({‘ip‘:ip,‘port‘:port}) badNum +=1 continue print‘success proxy num : ‘,goodNum print‘bad proxy num : ‘,badNum

Today's share is here, if you think you can ah, remember to support me. Code uploaded on GitHub: Https://github.com/qiyeboy/proxySpider_normal

You are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support