Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

Source: Internet
Author: User

First of all, let's keep you waiting. Originally intended to be updated 520 that day, but a fine thought, also only I such a single dog still doing scientific research, we may not mind to see the updated article, so dragged to today. But busy 521,522 this day and a half, I have added the database, fixed some bugs (now someone will say that really is a single dog).

Well, don't say much nonsense, let's go into today's theme. On two articles scrapy climbed beautiful pictures, we explained the use of scrapy. But just recently, there are enthusiastic friends told me before the program can not crawl to the picture, I guess it should be fried egg net added anti-crawler mechanism. So today is to break the anti-crawler mechanism of the previous agent IP.

Now a lot of Web site anti-crawler One of the practices (of course, there are other tests) is: the detection of an IP repetitive operation, so as to determine whether the crawler or manual. So using proxy IP can break this blockade. As a student party, no money specifically to buy VPN and IP pool, so we use the proxy IP from the network free, basically enough for personal use. Next we'll talk about crawling free IP and verifying the availability of the proxy IP.

Online has a lot of proxy IP site, this time I chose is http://www.xicidaili.com/nn/, we can try other websites, we try to do a large proxy IP pool.

Whether you notice the high-hide two words, high-hidden means: The other server does not know that you use the agent, but also do not know your real IP, so the concealment is very high.

Seriously.

Follow our previous learning crawler, using the Firebug review element to see how to parse the HTML.

  

Actually is a table, parse inside each row, this is very simple, we use beautifulsoup very easy to parse out.

It should also be noted that the number of pages in the IP table on each page is corresponding to the parameters in the URL. For example, the first page is HTTP://WWW.XICIDAILI.COM/NN/1. This saves us the trouble of turning the page.

Here is the structure of the program:

  

DB package in Db_helper: the implementation of MongoDB additions and deletions to the search.

Detect in Package Detect_proxy: Verifying the availability of proxy IPs

Proxy_info in the Entity package: object to proxy information

Spider Bag:

Spiderman Realizing the logic of the crawler

Html_downloader implements the crawler HTML downloader

Html_parser implements a crawler HTML parser

Test package: Testing of the sample, not involving program run

main.py: Implementing command-line parameter definitions

Also say the detection: I use http://ip.chinaz.com/getip.aspx as the detection URL, as long as the use of proxy access does not time out, and the response code is 200, we think is a successful agent.

Next run the program to see the effect:

Switch to the project directory under Windows, run Python main.py-h, and see the usage instructions and parameter settings I defined.

Then run Python main.py-c 1 4 (meaning crawl 1-4 pages of IP address):

This time if you want to verify the correctness of the IP: Run Python main.py-t db

It seems that the use of IP is still relatively small, but for the individual is enough.

Take a look at the MongoDB database:

We can use these IPs the next time we crawl the pictures.

The following code for parsing and validation is posted:

1234567891011121314151617181920212223 defparser(self,html_cont):    ‘‘‘    :param html_cont:    :return:    ‘‘‘    ifhtml_cont isNone:        return    # 使用BeautifulSoup模块对html进行解析    soup =BeautifulSoup(html_cont,‘html.parser‘,from_encoding=‘utf-8‘)    tr_nodes =soup.find_all(‘tr‘,class_=True)    fortr_node intr_nodes:        proxy =proxy_infor()        =0        forth intr_node.children:            if th.string !=Noneandlen(th.string.strip()) > 0:                proxy.proxy[proxy.proxyName[i]] =th.string.strip()                print‘proxy‘,th.string.strip()                +=1                if(i>1):                    break        self.db_helper.insert({proxy.proxyName[0]:proxy.proxy[proxy.proxyName[0]],proxy.proxyName[1]:proxy.proxy[proxy.proxyName[1]]},proxy.proxy)

  

Verify part of the core code:

123456789101112131415161718192021222324252627282930 defdetect(self):    ‘‘‘    http://ip.chinaz.com/getip.aspx  作为检测目标    :return:    ‘‘‘    proxys =self.db_helper.proxys.find()    badNum =0    goodNum =0    forproxy inproxys:        ip =proxy[‘ip‘]        port =proxy[‘port‘]        try:            proxy_host ="http://"+ip+‘:‘+port #            response =urllib.urlopen(self.url,proxies={"http":proxy_host})            ifresponse.getcode()!=200:                self.db_helper.delete({‘ip‘:ip,‘port‘:port})                badNum +=1                printproxy_host,‘bad proxy‘            else:                goodNum +=1                printproxy_host,‘success proxy‘        exceptException,e:            printproxy_host,‘bad proxy‘            self.db_helper.delete({‘ip‘:ip,‘port‘:port})            badNum +=1            continue     print‘success proxy num : ‘,goodNum    print‘bad proxy num : ‘,badNum

  

Today's share is here, if you think you can ah, remember to support me. Code uploaded on GitHub: Https://github.com/qiyeboy/proxySpider_normal

You are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. Respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.