Introduction to Python Crawlers (ii)--IP proxy usage

Source: Internet
Author: User

In the previous section, I probably talked about the Python crawler's writing process, starting with this section to focus on how to break the limit in the crawl process. For example, IP, JS, verification code and so on. This section is mainly about leveraging IP proxy breakthroughs.

1. About the agent

Simply put, the agent is a change of identity. One of the identities in the network is IP. For example, we are in the wall, want to visit Google, U2B, FB, etc., direct access is 404, so to change the IP will not be a wall, such as foreign IP. This is a simple proxy.

In the crawler, some sites may be to prevent reptiles or DDoS, etc., will record the number of visits per IP, for example, some sites allow an IP in 1s (or other) can only access 10 times, then we need to access a different IP (specific what strategy, its own decision).

So the question is, where do these agents get from? For the company, buy proxy IP. But for individuals, there may be a waste. So what do we do? There are a lot of free proxy IP sites on the Internet, but it's a waste of time to change them manually, and the free IP is a lot of unusable. So, we can crawl that IP with crawlers. Using the code from the previous section, you can do it completely. Here we use HTTP://WWW.XICIDAILI.COM/NN/1 test, statement: Only learn to communicate, do not use for commercial purposes, etc.

2. Obtain the proxy IP code as follows:

  

#Encoding=utf8ImportUrllib2Importbeautifulsoupuser_agent='mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) gecko/20100101 firefox/43.0'Header={}header['user-agent'] =User_agenturl='HTTP://WWW.XICIDAILI.COM/NN/1'req= Urllib2. Request (url,headers=header) Res=Urllib2.urlopen (req). Read () Soup=Beautifulsoup.beautifulsoup (res) IPs= Soup.findall ('TR') F= Open (".. /src/proxy","W") forXinchRange (1, Len (IPS)): IP=Ips[x] TDs= Ip.findall ("TD") Ip_temp= tds[2].contents[0]+"\ t"+tds[3].contents[0]+"\ n"    #print tds[2].contents[0]+ "\ t" +tds[3].contents[0]F.write (ip_temp)

Code Description:

a). Here we use the Urllib2 module, because, this request is a bit special, the server will verify the header in the request (if in doubt, refer to the relevant information of HTTP)

b). The difference between URLLIB2 and Urllib is that URLLIB2 can carry parameters when sending a request (I only use this difference now)

c). Open () Opens the file, the first parameter is the path to the file can be filled absolute path, such as E:\\proxy ("\" in programming is a special character, to use "\ \" to represent the actual "\"). It can also be a relative path, such as ". /src/proxy"is the location of the file relative to the code. The second parameter, "W", represents the permission to open a file, W represents write permission, and R represents Read permission. This is common in many systems. For example, Linux, etc.

d). For loops, if you've learned Java or other high-level languages before, you may not be used to it because they use the for (;;) Like that. A For loop in Python that represents the value of x in order to take the argument after in

Special Note: Do not forget the colon (":") after the for statement

c). The range function, which represents the generation of a series of numbers, if range (0,6,1), which means starting at 0, ending with 6 (excluding 6), increasing each time by 1 (that is, the step is 1), generating an array, the result is [0, 1, 2, 3, 4, 5]

e). F.write () is to write data to the file, if you open the file, there is no "w" permission, you cannot write.

Page:

Operation Result:

  

3. Not all agents can be used for a number of reasons, it may be that our network is not connected to this agent, there may be this agent, not even our target URL, so we have to verify. Take http://ip.chinaz.com/getip.aspx as the target URL (this is the URL of the test IP address) code as follows:

  

#Encoding=utf8ImportUrllibImportSocketsocket.setdefaulttimeout (3) F= Open (".. /src/proxy") Lines=f.readlines () Proxys= [] forIinchRange (0,len (lines)): IP= Lines[i].strip ("\ n"). Split ("\ t") Proxy_host="/ http"+ip[0]+":"+ip[1] Proxy_temp= {"http":p Roxy_host} proxys.append (proxy_temp) URL="http://ip.chinaz.com/getip.aspx" forProxyinchProxys:Try: Res= Urllib.urlopen (url,proxies=proxy). Read ()PrintResexceptexception,e:PrintProxyPrinteContinue

Code Description:

a). IP = lines[i].strip ("\ n"). Split ("\T") This is to remove the linebreak at the end of each line (that is, "\ n"), then a tab (that is, "\ T" ) Split string as a string array

b). proxy_temp = {"http":p roxy_host} where HTTP represents the type of proxy, besides HTTP there are https,socket and so on HTTP as an example

c). Urllib.urlopen (url,proxies= proxy) where proxies is the agent. Access destination URLs in proxy mode

D). socket.setdefaulttimeout (3) Sets the global time-out to 3s, that is, if a request does not respond within 3s, end the access and return timeout (timeout)

Run results

  

There are not many available from the results. But it's enough for personal use.

At this point, the use of the IP proxy is over.

Note:

1. The code is for learning communication only and should not be used for commercial purposes

2. If there is a problem with the code, advise

3. Please specify the source of the reprint

Introduction to Python Crawlers (ii)--IP proxy usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.