The path to python crawler growth (2): crawling proxy IP addresses and multi-thread verification, the path to python Growth

Last Update:2016-10-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As mentioned above, one of the ways to break through anti-crawler restrictions is to use several proxy IP addresses, but the premise is that we have to have a valid proxy IP address, next we will introduce the process of capturing the proxy IP address and quickly verifying its validity with multiple threads.

1. Capture the proxy IP Address

There are quite a lot of websites that provide free proxy IP addresses. My IP address is blocked after I have caught a fierce attack on the 'westbone agent. I had to change the 'IP bus' and cool down the crawling speed. Paste and capture code

Import urllib. requestimport urllibimport reimport timeimport random # capture proxy IPip_totle = [] # content list of all pages for page in range (): url = 'HTTP: // ip84.com/dlgn/'pagestr (page) headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64)"} request = urllib. request. request (url = url, headers = headers) response = urllib. request. urlopen (request) content = response. read (). decode ('utf-8') print ('get page', page) pattern = re. compile ('< Td> (\ d .*?) </Td> ') # intercept the content ip_page = re. findall (pattern, str (content) ip_totle.extend (ip_page) time. sleep (random. choice (range (1, 3) # print the captured content print ('proxy IP address', '\ t', 'Port',' \ t', 'speed ', '\ t', 'verification Time') for I in range (0, len (ip_totle), 4): print (ip_totle [I], '',' \ t ', ip_totle [I + 1], '\ t', ip_totle [I + 2],' \ t', ip_totle [I + 3])

Copy the above Code to capture the mainland China high-speed proxy IP address on the IP bus. You can modify the URL of other regions or types by yourself. This may be the reason why the website content is being updated in real time, if the capture from the first page is not very stable, so I capture from the second page, print some results as follows:

Ii. verify the validity of the proxy IP Address

Because the network in which the proxy is located may not be connected to the proxy or the proxy cannot connect to the target URL, the captured proxy may be invalid. It is necessary to verify the validity of the captured proxy IP address. In the urllib. request package, you can set a proxy to access the webpage. The Code is as follows:

Import urllib. requesturl = "http://quote.stockstar.com/stock" # web page proxy_ip = {'HTTP ': '27. 17.32.142: 80'} # proxy IPproxy_support = urllib. request. proxyHandler (proxy_ip) opener = urllib. request. build_opener (proxy_support) opener. addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64)")] urllib. request. install_opener (opener) print (urllib. request. urlopen (url ). read ())

If the IP address is valid, the source code of the webpage can be printed; otherwise, an error may occur. So we can use the above Code to verify the captured proxy IP one by one.

Iii. Fast multi-thread Verification

It is slow to verify the validity of the proxy IP one by one in sequence. in python, multithreading is similar to executing multiple different programs at the same time, with multithreading, you can place tasks that occupy a long period of time in the background for processing. It is useful to implement the launch process for some tasks that need to be waited.

Python supports threads through two standard libraries: thread and threading. The thread module provides low-level, original threads, and a simple lock. Threading is a commonly used python multi-thread module with more functions. The threading module provides a Thread class, which can instantiate an object. Each object represents a Thread. The following describes how to use classes in the threading module.

First, we will introduce the thread lock. If multiple threads operate on an object at the same time and the object is not well protected, the program results may be unpredictable, for example, if one of our print statements prints only half of the characters, the thread will be paused and the other will be executed, so we will see a messy result. This phenomenon is called "thread unsafe ". The Threading module provides Threading for us. lock class. We create an object of this class. Before the thread function is executed, the Lock is "preemptible". After the execution is complete, the Lock is "released, we ensure that only one thread occupies the lock at a time. At this time, operations on a public object will not cause thread insecurity. We first create a threading. Lock Class Object lock and use lock. acquire () to obtain the Lock. At this time, other threads will no longer be able to obtain the lock, and they will block the "if lock. acquire () "here until the lock is released by another thread: lock. release ().

Then we will introduce the Thread class in the threading module. Class threading. thread (group = None, target = None, name = None, args = (), kwargs = {}, *, daemon = None ), this constructor usually uses some keyword parameters. The following describes these keywords:
Group: This variable is reserved for future extension.
Target: a callable object called through the run () method. The default value is none, which means nothing is done.
Name: the name of the thread. By default, a unique name is "thread-n", where n is a small decimal number.
Args: the parameter of the tuples, called by the target.
Kwargs: Dictionary of keyword parameters, called by target.
Daemon: sets whether daemon is daemon. If no setting is displayed, the attributes of daemon are inherited from the current thread.
If the subclass overrides this constructor, it must ensure that the base class constructor thread. _ init _ () is called before doing anything else __(). The methods used in this article include:
Start (self)
Start thread running. Each thread object can only be called once at most. It calls the run () method of the called object and controls the independent running of each object. That is to say, the called object must have the run () method. When using the Thread class to instantiate the object, because the Thread already has the run () method, you can ignore it. However, when creating a subclass of the basic Thread, we generally need to override the run () method of the subclass.
Join (self, timeout = None)
The main thread is blocked until the sub-thread that calls this method finishes running or times out. Timeout indicates the timeout time. It can be a number, such as an integer, a decimal number, or a fraction. It indicates the timeout time, in seconds. The return value is None. You can call isAlive to check whether the thread ends after the join timeout. If the thread is still active, the join operation times out. You can continue to call the join operation or perform other processing. When timeout is not provided or is None, it is blocked until the end of the subthread that calls this method. A thread can call the join method multiple times.

The main program for multi-thread verification is as follows:

# Multithreading verification threads = [] for I in range (len (proxys): thread = threading. thread (target = test, args = [I]) threads. append (thread) thread. start () # block the main process and wait for all sub-threads to end for thread in threads: thread. join ()

At the beginning, when I set the metadatabase parameter args = (I), the error 'test () argument after * must be an iterable, not int' is returned, it will be useful if you change the parentheses to brackets by mistake. For the moment, I don't know why. I hope you can tell me the cause. The running result of the program is as follows:

Multi-thread authentication is several times faster than single-thread authentication, so you can use this program to capture some valid proxy IP addresses when the number of web pages crawled is large, in this way, the IP address is blocked. The complete code for python3 to capture proxy IP addresses and quickly verify using multiple threads is as follows:

Import urllib. requestimport urllibimport reimport timeimport randomimport socketimport threading # capture proxy IPip_totle = [] for page in range (2, 6): url = 'HTTP: // fetch (page) # url = 'HTTP: // www.xicidaili.com/nn/'{str (page) # westbone proxy headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64)"} request = urllib. request. request (url = url, headers = headers) response = urllib. request. urlopen (request) content = respon Se. read (). decode ('utf-8') print ('get page', page) pattern = re. compile ('<td> (\ d .*?) </Td> ') # intercept the content ip_page = re. findall (pattern, str (content) ip_totle.extend (ip_page) time. sleep (random. choice (range (1, 3) # print the captured content print ('proxy IP address', '\ t', 'Port',' \ t', 'speed ', '\ t', 'verification Time') for I in range (0, len (ip_totle), 4): print (ip_totle [I], '',' \ t ', ip_totle [I + 1], '\ t', ip_totle [I + 2],' \ t', ip_totle [I + 3]) # sort out the proxy IP Format proxys = [] for I in range (0, len (ip_totle), 4): proxy_host = ip_totle [I] + ': '+ ip_totle [I + 1] proxy_temp = {"http": proxy_host} proxys.append(proxy_temp1_proxy_ip1_open('proxy_ip.txt', 'w') # create a file lock = threading for storing valid IP addresses. lock () # create a Lock # def test (I): socket. setdefatimetimeout (5) # set the global timeout url = "http://quote.stockstar.com/stock" # try: proxy_support = urllib. request. proxyHandler (proxys [I]) opener = urllib. request. build_opener (proxy_support) opener. addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64)")] urllib. request. install_opener (opener) res = urllib. request. urlopen (url ). read () lock. acquire () # obtain the lock print (proxys [I], 'is OK') proxy_ip.write ('% s \ n' % str (proxys [I]) # Write this proxy IP address lock. release () # release lock failed t Exception as e: lock. acquire () print (proxys [I], e) lock. release () # single-thread verification ''' for I in range (len (proxys): test (I) ''' # threads = [] for I in range (len (proxys): thread = threading. thread (target = test, args = [I]) threads. append (thread) thread. start () # block the main process and wait for all sub-threads to end for thread in threads: thread. join () proxy_ip.close () # close the file

Proxy_ip

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The path to python crawler growth (2): crawling proxy IP addresses and multi-thread verification, the path to python Growth

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The path to python crawler growth (2): crawling proxy IP addresses and multi-thread verification, the path to python Growth

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support