This address: http://www.cnblogs.com/likeli/p/5719230.htmlObjective
This article does not provide anything like a ladder , I am here to discuss only the IP blocking that the crawler encounters, and then how to use Tor to fight this ban. As a technical research discussion.
Scene
When we write the Network crawler full network collection, there will always be some sites consciously protect their site content, to prevent crawling crawler. The common way is to use the authentication method to do human-machine recognition. This means adding or reinforcing defenses at the entrance to the landing (query). What about these defenses? What I see now are: Various verification code, parameter encryption, in front of JS digging pit, Access frequency limit (IP blacklist) and so on.
In fact, the previous several of us in some cases there is a way to solve, I one example:
1, encryption parameters. In fact, the old drivers know that in the client encryption parameters and no egg. Because the crawler can completely throw the front-end JS into a browser's kernel environment to execute JS, so, no matter how you encrypt, no use, because this and in the browser run no difference, is unable to do human machine recognition.
2, the front-end in the JS script dug pit. This is some clever, after all, the site is crawled by the game of the rule-makers, they can make their own rules, and then in the absence of any loopholes, the crawler can only follow the rules specified by the other side of a hurdle to the cross.
In this case, the site developers in a large heap of JS hidden a small warning JS as a landmine. After all, in general, the crawler is directly requested after the response is a piece of HTML text, and will not execute the JS. Then this distinguishes, the site can be loaded after the page to execute a JS, this section of JS does not communicate with the server, is silently executed. If this JS execution, the visitor is likely to be a person, if not executed, then this visit is really a reptile. We add a specific tag to the cookie that accompanies the text request. Tell the server that the request was not initiated by a person. The server gets this message for the IP tag, but this request is allowed to pass (hide our judgment basis). Next time or this IP access several times, the IP is pulled into the blacklist.
3, verification code, this thing is the main defense means, there is not much to say, I also have a blog about this article. However, as long as the technical capacity is sufficient, the verification code will be broken. No, 12306 Captcha defense is useless.
4, IP blacklist, this is dependent on the above a background defense strategy. But in some cases, this strategy is really effective and has no solution.
For example: There is a query class site, by restricting the number of IP access, frequency can be completely blocked or restricted crawler, because the meaning of the crawler is automated, efficient data.
Scheme of IP blacklist breakthrough
For the blacklist of the site, we can use the policy is the agent, we use a variety of ways to get a large number of proxy IP, and then use these proxy IP to initiate the request, the IP is blocked, the next.
Our topic, Tor network is also used here.
First of all, to science:
about the Tor network
Official website: https://www.torproject.org/
Tor
is what
Tor is one of the most powerful tools on the Internet to protect your privacy, but many people today often think of Tor as a terminal encryption tool. In fact, Tor is used to anonymously browse the Web and send mail (not encrypted by mail content). Today, we're going to talk about how Tor works, what it does, what it doesn't do, and how we can use it correctly.
Tor
works like this.
When you send mail through Tor, Tor uses an encryption technique called "Onion routing" to deliver messages through a randomly generated process on the network. It's kind of like putting a vomited in a stack of letters. Each node in the network decrypts the message (the outer envelope that was opened), and then sends the internally encrypted content (sealed envelope) to its next address. This results in a single node being unable to see the entire contents of the letter, and the transmission path of the message is difficult to track.
Using Tor on Windows
Installing Tor on Windows is simple enough to download and install the Onion Viewer on Tor's website.
Of course, we can just install the Tor core, not install any other dependencies, and then we go to a Tor controller to operate Tor.
I have two versions of Tor Controller, Windows Vidalia and OS X version of Arm (anonymizing Relay Monitor), which is developed in Python and can do most of Vidalia's functions.
Vidalia on Windows:
Arm on Mac:
At the moment I'm using arm, which is operating on Mac. I'm talking about arm, too. Because Tor under window is not controlled by C # or Python, you have to switch to the OS x/linux environment to control Tor.
Installing Tor, ARM
First we have to download the installation, the good news is, arm and Tor most of the package manager has, we can download directly. When downloaded through the Package Manager, it is installed automatically and the initialization configuration is completed.
For example, my installation on a Mac and the configuration:
Brew Install Torbrew Install arm
We also need to install Privoxy, which requires Privoxy to convert the SOCKS5 to HTTP.
Brew Install Privoxy
Finally, we need a pre-agent because the Tor network is inaccessible at home. So we need a pre-agent in foreign countries, I have already set up a VPN in Canada, which can be used directly.
Configure Tor, ARM
We need to do some configuration, I first give a picture of my configuration:
This is seen from arm that has been configured. For example, the green font is what I add to the TORRC configuration file.
We modify the configuration to be on the /USERS/LIKELI/.ARM/TORRC Path (the path under Mac). Complete the above modifications.
A description of the important parameters:
Parameters |
Description |
Controlport |
The port that the control program accesses (important) |
Socks5proxy |
Front Proxy Port |
Socksprot |
External program accesses Tor's port |
Maxcircuitdirtiness |
Time interval for automatic switching of identity |
In addition to these parameters, there are a lot of alternative parameters, detailed instructions please review the Tor help document, the above configuration is also I found in the official Tor help document.
Man Tor
OK, the configuration is complete, now go out and start arm, complete the initialization.
In the terminal to run arm, I directly with the tools of the Mac, seemingly unable to draw a circle tick, here to provide a few pictures to get out, according to the choice is OK.
ARM Configuration Source Address: https://program-think.blogspot.com/2015/03/Tor-Arm.html?utm_source=tuicool&utm_medium=referral
The configuration is ready and then started, after successful startup such as:
The boot log below shows that the boot progress is 100%.
Well, to this step, in fact, the agent has been through, to test a test.
OK, done, the current Tor SOCKS5 agent has been connected, we directly connected to 127.0.0.1:9000 on it. The port here is itself based on the configuration above.
Closure
Although the proxy, but there are problems, because generally we are using the HTTP proxy. So we need to convert the SOCKS5 proxy to an HTTP proxy to make it easier for our applications to use.
The tool used here is: Privoxy (in the above steps, has been installed through the software library)
We need to do a little bit of this configuration modification.
After we install Privoxy, open its configuration file:
When open, search for 127.0.0.1:9050
After you find it, insert the forward-socks5/127.0.0.1:9000 in the other row .
After the configuration is complete save shutdown, if we try to connect to the local 8118 port, that is 127.0.0.1:8118
If you are not connected, restart the service, or restart your computer.
Here the 8118 port can also be modified, if modified, please directly in the Privoxy configuration file search 127.0.0.1:8118, and modify 8118 port on it.
Here we have completed all the configuration, any program directly access 127.0.0.1:8118, and set as a proxy, you can evade the site's IP restrictions.
finally
The above test agent can install the Switchysharp plugin in chrome and then configure it slightly.
OK, free to enjoy the stimulus of unlimited IP, so later, the IP blacklist (IP block) is a dummy ~
Finally, a sample code that Python controls tor switch IP is attached: source (Https://stackoverflow.com/questions/9887505/how-to-change-tor-identity-in-python)
1 ImportUrllib22 fromTorctlImportTorctl3 4Proxy_support = Urllib2. Proxyhandler ({"http":"127.0.0.1:8118"})5Opener =Urllib2.build_opener (Proxy_support)6 7 defnewId ():8conn = Torctl.connect (controladdr="127.0.0.1", controlport=9051, passphrase="Your_password")9Conn.send_signal ("newnym")Ten One forIinchRange (0, 10): A Print " Case"+str (i+1) - newId () -Proxy_support = Urllib2. Proxyhandler ({"http":"127.0.0.1:8118"}) the Urllib2.install_opener (opener) - Print(Urllib2.urlopen ("HTTP://WWW.IFCONFIG.ME/IP"). Read ())
Tor network breaks through IP blocking, bot good partner "Getting Started manual"