Pytohn crawler growth path: Crawling proxy IP and multithreading verification

Source: Internet
Author: User
Tags thread class

Last said, one way to break the anti-crawler limit is to use a few proxy IPs, but the premise is that we have to have a valid proxy IP, the following we describe the crawl proxy IP and multithreading to quickly verify the validity of the process.

One, crawling proxy IP

Provide free proxy IP site is quite a lot of, I am in the ' West Thorn Agent ' on a burst of hard to catch their own IP was blocked by it. Had to change the ' IP bus ' and obediently slow down the crawl speed. Paste the Grab Code

Importurllib.requestImportUrllibImportReImport TimeImportRandom#Crawling proxy IPIp_totle=[]#Table of Contents for all pages forPageinchRange (2,6): URL='http://ip84.com/dlgn/'+STR (page) headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64)"} request=urllib.request.request (url=url,headers=headers) Response=Urllib.request.urlopen (Request) Content=response.read (). Decode ('Utf-8')    Print('Get page', page) pattern=re.compile ('<td> (\d.*?) </td>')#capture the first number of items between <td> and </td>Ip_page=Re.findall (pattern,str (content)) Ip_totle.extend (ip_page) Time.sleep (Random.choice (Range (1,3)))#Print Crawl ContentPrint('Proxy IP Address','\ t','Port','\ t','Speed','\ t','Validation Time') forIinchRange (0,len (ip_totle), 4):    Print(Ip_totle[i],'    ','\ t', Ip_totle[i+1],'\ t', Ip_totle[i+2],'\ t', Ip_totle[i+3])

Copy the above code to crawl the IP bus on the mainland high stealth proxy IP, other regions or types of self-changing URLs, may be the site content in real-time update reasons, if starting from the first page crawl is not stable, so I started to crawl from the second page, the print part of the results are as follows

Second, verify the validity of proxy IP

Because the network may not be connected to this agent or the agent is not connected to the destination URL, and so on, we crawl of the agent may be invalid, we need to verify that the fetched proxy IP validity. The Proxyhandler class in the Urllib.request package can set the proxy to access the Web page with the following code

ImportUrllib.requesturl="Http://quote.stockstar.com/stock"  #Web page that intends to crawl contentproxy_ip={'http':'27.17.32.142:80'}#proxy IP that you want to verifyProxy_support =Urllib.request.ProxyHandler (PROXY_IP) opener=Urllib.request.build_opener (proxy_support) opener.addheaders=[("user-agent","mozilla/5.0 (Windows NT 10.0; WOW64)")]urllib.request.install_opener (opener)Print(Urllib.request.urlopen (URL). Read ())

If the IP is valid, you can print out the source of the Web page, otherwise there will be errors. So we can verify the captured proxy IP one by one through the above code.

Three, multi-threaded fast verification

In order to verify the effectiveness of the proxy IP is slow, Python has a multi-threaded module, multi-threaded similar to executing several different programs at the same time, the use of multi-threading can occupy a long period of time in the program of the task in the background to deal with, in some need to wait for the task to implement the online process is more useful.

Python provides support for threads through the two standard library thread and threading. The thread module provides a low-level, raw thread, and a simple lock. Threading is the most commonly used Python multi-threading module, which is more versatile. A thread class is provided in the threading module that instantiates an object, each representing a thread. Below we introduce the classes in this article using the Threading module.

We introduce the thread lock, if there are multiple threads operating an object at the same time, if the object is not well protected, it will cause the program results are not expected, such as one of our print statement printed only half of the characters, the thread is paused, to execute the other went, so we see the results will be very messy, This behavior is called "thread insecure." The threading module provides us with the Threading.lock class, we create a class object, before the thread function executes, "preemption" the lock, after execution completes, "releases" the lock, then we ensure that only one thread occupies the lock at a time. When you operate on a common object, there is no thread-insecure behavior. We set up a threading first. Lock class object Lock, using Lock.acquire () to obtain this lock. At this point, the other threads will no longer be able to acquire the lock, and they will block the "if Lock.acquire ()" Here until the lock is freed by another thread: Lock.release ().

We then introduce the thread class in the threading module. Class Threading. Thread (Group=none, Target=none, Name=none, args= (), kwargs={}, *,daemon=none), this constructor usually uses some keyword parameters, below we understand these keywords:
Group: This variable is used as a reserved variable for later expansion, and is not considered for the time being.
Target: is a callable object that is called through the Run () method. The default is none, which means nothing is done.
Name: The names of the threads. By default, a unique name is the form of "Thread-n", where n is a small decimal number.
Args: The tuple parameter, which is called by the target.
Kwargs: A dictionary of keyword parameters that is called by Target.
Daemon: Sets whether daemon daemon is inherited from the current thread if no settings are displayed for the Daemon property.
If the subclass overrides this constructor, it must make sure to call the base class's constructor thread.__init__ () before doing anything else. The methods used in this article for the thread class are:
Start (self)
The start thread runs, and each thread object is called at most once. It invokes the run () method of the called object and controls the individual objects running independently. That is, the called object must have the run () method, when using the thread class to instantiate the object, because the thread has already had the run () method, so you can ignore. However, when the base thread creates subclasses, we typically rewrite the subclass's run () method.
Join (self, timeout=none)
Blocks the main thread until the child thread that called the method finishes running or times out. Timeout indicates the time-out time, which can be a number, such as an integer, a decimal, a fraction, indicating the timeout, in seconds. The return value is None, and you can call isAlive after the join time-out to confirm whether the thread has ended. If the thread is also active, the join triggers a timeout, at which point you can continue to call join or do other processing. When timeout is not given or is none, it will block until the child thread that called this method ends. A thread can call multiple join methods.

The main procedures for multithreading verification are as follows

# Multithreading Verification     Threads=[] for in range (len (proxys)):    thread=threading. Thread (target=test,args=[i])    threads.append (thread)    Thread.Start ()# blocks the main process, Wait for all child threads to end for  in Threads:    thread.join ()

At first I let the tuple parameter args= (i), the result of the ' test () argument after * must is an iterable, not int ' ERROR, the small parenthesis to the square brackets to be useful, temporarily puzzled in its reason, hope that the person can inform. The program section runs the following results

Pro-Test multi-threaded verification is several times faster than single-threaded verification, so later in the crawl Web page volume is larger then you can use this program to catch some effective proxy IP, so you can solve the problem of IP is blocked. Python3 the complete code to crawl the proxy IP and quickly verify with multithreading is as follows

Importurllib.requestImportUrllibImportReImport TimeImportRandomImportSocketImportThreading#Crawling proxy IPip_totle=[] forPageinchRange (2,6): URL='http://ip84.com/dlgn/'+Str (page)#url= ' http://www.xicidaili.com/nn/' +str (page) #西刺代理headers={"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64)"} request=urllib.request.request (url=url,headers=headers) Response=Urllib.request.urlopen (Request) Content=response.read (). Decode ('Utf-8')    Print('Get page', page) pattern=re.compile ('<td> (\d.*?) </td>')#capture the first number of items between <td> and </td>Ip_page=Re.findall (pattern,str (content)) Ip_totle.extend (ip_page) Time.sleep (Random.choice (Range (1,3)))#Print Crawl ContentPrint('Proxy IP Address','\ t','Port','\ t','Speed','\ t','Validation Time') forIinchRange (0,len (ip_totle), 4):    Print(Ip_totle[i],'    ','\ t', Ip_totle[i+1],'\ t', Ip_totle[i+2],'\ t', ip_totle[i+3])#organizing proxy IP formatsProxys = [] forIinchRange (0,len (ip_totle), 4): Proxy_host= ip_totle[i]+':'+ip_totle[i+1] Proxy_temp= {"http":p Roxy_host} proxys.append (proxy_temp) proxy_ip=open ('Proxy_ip.txt','W')#Create a new document that stores a valid IPLock=threading. Lock ()#Create a lock#methods to verify proxy IP validitydefTest (i): Socket.setdefaulttimeout (5)#Set global timeout timeURL ="Http://quote.stockstar.com/stock"  #URLs that you plan to crawl    Try: Proxy_support=Urllib.request.ProxyHandler (Proxys[i]) opener=Urllib.request.build_opener (proxy_support) opener.addheaders=[("user-agent","mozilla/5.0 (Windows NT 10.0; WOW64)")] Urllib.request.install_opener (opener) Res=urllib.request.urlopen (URL). Read () Lock.acquire ()#Get lock        Print(Proxys[i],'is OK') Proxy_ip.write ('%s\n'%str (Proxys[i]))#write to the proxy IPLock.release ()#Release Lock    exceptException as E:lock.acquire ()Print(proxys[i],e) lock.release ()#Single Thread Validation" "For i in range (len (proxys)): Test (i)" "#Multithreading Verificationthreads=[] forIinchRange (len (proxys)): Thread=threading. Thread (target=test,args=[i]) threads.append (thread) Thread.Start ()#block the main process, waiting for all child threads to end forThreadinchThreads:thread.join () proxy_ip.close ( )#Close File
proxy_ip

Pytohn crawler growth path: Crawling proxy IP and multithreading verification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.