Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

Last Update:2016-05-22 Source: Internet

Author: User

Tags response code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

　　First of all, let's keep you waiting. Originally intended to 520 that day to update, but a fine thought, also only I such a single dog still doing scientific research, we may not mind to see the updated article, so dragged to today. But I'm busy. 521,522 This day and a half, I have added the database, fixed some bugs( Now someone will say that really is a single dog ).

Well, don't say much nonsense, let's go into today's theme. On two articles scrapy climbed beautiful pictures, we explained the use of scrapy. But just recently, there are enthusiastic friends told me before the program can not crawl to the picture, I guess it should be fried egg net added anti-crawler mechanism. So today is to break the anti-crawler mechanism of the previous agent IP.

now a lot of Web site anti-crawler One of the practices (of course, there are other tests) is: The detection of an IP repetitive operation, so as to determine whether the crawler or manual . So using proxy IP can break this blockade. As a student party, no money specifically to buy VPN and IP pool, so we use the proxy IP from the network free, basically enough for personal use. Next we'll talk about crawling free IP and verifying the availability of the proxy IP .

Online has a lot of proxy IP site, this time I chose is http://www.xicidaili.com/nn/, we can try other websites, we try to do a large proxy IP pool .

Whether you notice the high -hide two words, high-hidden means: The other server does not know that you use the agent, but also do not know your real IP, so the concealment is very high.

seriously.

Follow our previous learning crawler, using the Firebug review element to see how to parse the HTML.

is actually a table, parse each line inside , this is very simple, we use The BeautifulSoup is easy to parse out.

It should also be noted that the number of pages in the IP table on each page is corresponding to the parameters in the URL . For example, the first page is HTTP://WWW.XICIDAILI.COM/NN/1. This saves us the trouble of turning the page.

　Here is the structure of the program:

DB package in Db_helper: the implementation of MongoDB additions and deletions to the search .

Detect in Package Detect_proxy: Verifying the availability of proxy IPs

Proxy_info in the Entity package: object to proxy information

Spider Bag:

Spiderman Realizing the logic of the crawler

Html_downloader implements the crawler HTML Downloader

Html_parser implements a crawler HTML parser

Test Package: Testing of the sample, not involving program run

main.py: implementing command-line parameter definitions

Also say the detection: I use http://ip.chinaz.com/getip.aspx as a detection URL, as long as the use of proxy access does not time out, and the response code is, we think is a successful agent.

Next run the program to see the effect:

Switch to the project directory under Windows, run python main.py-h, and see the usage instructions and parameter settings I defined.

Then run python main.py-c 1 4 (meaning the IP address of the crawl 1-4 page):

This time if you want to verify the correctness of the IP: run python main.py-t db

It seems that the use of IP is still relatively small, but for the individual is enough.

Take a look at the MongoDB database:

We can use these IPs the next time we crawl the pictures .

The following code for parsing and validation is posted:

def parser (Self,html_cont):    "    :p Aram Html_cont:    : return:    "    if Html_cont is None:        return    # Parse HTML using beautifulsoup module    soup = BeautifulSoup (Html_cont, ' Html.parser ', from_encoding= ' utf-8 ')    tr_ nodes = Soup.find_all (' tr ', Class_ = True) for    Tr_node in tr_nodes:        proxy = proxy_infor ()        i = 0 for        th i N Tr_node.children:            if th.string! = None and Len (Th.string.strip ()) > 0:                proxy.proxy[proxy.proxyname[i]] = Th.string.strip ()                print  ' proxy ', Th.string.strip ()                i + = 1                if (i>1): Break        self.db _helper.insert ({proxy.proxyname[0]:p roxy.proxy[proxy.proxyname[0]],proxy.proxyname[1]:p roxy.proxy[ Proxy.proxyname[1]]},proxy.proxy)

Verify part of the core code:

def detect (self):    '    http://ip.chinaz.com/getip.aspx  as detection target    : return:    '    proxys = Self.db_helper.proxys.find ()    badnum = 0    goodnum = 0 for    proxy in proxys:        IP = proxy[' IP ']        port = Pro xy[' Port ']        try:            proxy_host = "http://" +ip+ ' "+port #            response = Urllib.urlopen (self.url,proxies={") http ":p Roxy_host})            if Response.getcode ()!=200:                self.db_helper.delete ({' IP ': IP, ' Port ':p ort})                Badnum + = 1                print proxy_host, ' bad proxy '            else:                goodnum + = 1                print proxy_host, ' success Proxy '        Except exception,e:            print proxy_host, ' bad proxy '            self.db_helper.delete ({' IP ': IP, ' Port ':p ort})            Badnum + = 1            continue    print ' success proxy num: ', goodnum    print ' bad proxy num: ', badnum

Today's share is here, if you think you can ah, remember to support me. Code uploaded on github :https://github.com/qiyeboy/proxySpider_normal

you are welcome to support me. Public number:

This article belongs to original works, welcome everybody to reprint to share. respect the original, reprint please specify from: Seven Night story http://www.cnblogs.com/qiyeboy/

Scrapy Crawl Beauty Pictures Third set proxy IP (UP) (original)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More