Python crawler (2)-IP proxy usage, python Crawler
The previous section describes how to write a Python crawler. Starting from this section, it mainly addresses how to break through the restrictions in the crawling process. For example, IP, JS, and verification code. This section focuses on using IP proxy to break through.
1. About proxy
Simply put, a proxy is a new identity. One of the identities in the network is IP. For example, if we want to access google, u2b, and fb in the wall, the direct access is 404, so we need to change the IP address that will not be blocked, such as the IP address outside China. This is a simple proxy.
In crawlers, some websites may record the number of visits from each IP address to prevent crawlers or DDOS attacks. For example, some websites allow one IP address to be within 1 s (or another) you can only access the IP address for 10 times ).
So the question is, where do these proxies get from? For the company, buy a proxy IP address. However, personal data may be wasted. So what should we do? There are many free proxy IP websites on the Internet, but manual modification is a waste of time, and many free IP addresses are unavailable. Therefore, we can use crawlers to crawl such IP addresses. You can use the code in the previous section. Here we will use the http://www.xicidaili.com/nn/1test, the voice: only learning to communicate, do not use for commercial purposes, etc.
2. Obtain the proxy IP address. The Code is as follows:
#encoding=utf8import urllib2import BeautifulSoupUser_Agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'header = {}header['User-Agent'] = User_Agenturl = 'http://www.xicidaili.com/nn/1'req = urllib2.Request(url,headers=header)res = urllib2.urlopen(req).read()soup = BeautifulSoup.BeautifulSoup(res)ips = soup.findAll('tr')f = open("../src/proxy","w")for x in range(1,len(ips)): ip = ips[x] tds = ip.findAll("td") ip_temp = tds[2].contents[0]+"\t"+tds[3].contents[0]+"\n" # print tds[2].contents[0]+"\t"+tds[3].contents[0] f.write(ip_temp)
Code Description:
A) Here we use the urllib2 module, because this request is a bit special, the server will verify the header in the request (if you have any questions, refer to the relevant information of http)
B) The difference between urllib2 and urllib is that urllib2 can carry parameters when sending requests (I only use this difference now)
C ). open () is used to open a file. The first parameter is the absolute path of the file. For example, E: \ proxy ("\" is a special character in programming, use "\" to represent the actual "\"). It can also be a relative path, such as "../src/proxy", that is, the location of the file relative to the code. The second parameter "w" indicates the permission to open the file, w indicates the write permission, and r indicates the read permission. This is common in many systems. For example, linux
D). for Loop. If you have learned java or other advanced languages before, you may not be used to it because they use. In python, for loop, in indicates the value of X. The parameters following in are obtained in order.
Note: Do not forget the colon (":") after the for statement (":")
C ). the range function is used to generate a series of numbers. If range (0, 6, 1) is from 0 to 6 (excluding 6 ), add 1 (step 1) each time and generate an array. The result is [0, 1, 2, 3, 4, 5].
E). f. write () is to write data to the file. If you do not have the "w" permission when opening the file, you cannot write data.
Page:
Running result:
3. not all proxies can be used for many reasons. It may be because our network cannot connect to this proxy, or the proxy cannot connect to our target website, we need to verify it. Take http://ip.chinaz.com/getip.aspxas an example (this is the URL of the test IP address). The Code is as follows:
#encoding=utf8import urllibimport socketsocket.setdefaulttimeout(3)f = open("../src/proxy")lines = f.readlines()proxys = []for i in range(0,len(lines)): ip = lines[i].strip("\n").split("\t") proxy_host = "http://"+ip[0]+":"+ip[1] proxy_temp = {"http":proxy_host} proxys.append(proxy_temp)url = "http://ip.chinaz.com/getip.aspx"for proxy in proxys: try: res = urllib.urlopen(url,proxies=proxy).read() print res except Exception,e: print proxy print e continue
Code Description:
A ). ip = lines [I]. strip ("\ n "). split ("\ t") removes the line break (that is, "\ n") at the end of each line, and splits the string into a String Array Using a Tab character (that is, "\ t ").
B). proxy_temp = {"http": proxy_host}. http indicates the proxy type. In addition to http, there are also https and socket. Here we take http as an example.
C). urllib. urlopen (url, proxies = proxy) where proxies is the proxy. Access the target URL in proxy Mode
D). socket. setdefatimetimeout (3) sets the global timeout value to 3 s. That is to say, if a request has not responded within 3 s, the access is terminated and timeout is returned)
Running result
There are not many available results. But it is enough for personal use.
So far, the use of IP proxy is over.
Note:
1. The code is for learning and communication only. Do not use it for commercial purposes.
2. If you have any questions about the code, please kindly advise.
3. indicate the source for reprinting.