6 Common methods for Python spiders to break through bans __python

Source: Internet
Author: User
Tags anonymous html form time interval

Automatic data collection on the Internet (crawl) This is almost as long as the internet exists. Today, the public seems to be more inclined to use "network data Acquisition", sometimes the network data acquisition program called Network Robot (bots). The most common approach is to write an automated program to request data from a Web server (usually an HTML form or other Web page file), and then parse the data to extract the required information.

This article assumes that the reader has learned how to crawl a remote URL with code and how the form is submitted and how JavaScript works in the browser. To learn more about the basic knowledge of network data collection, you can refer to the data after the article.

When collecting websites, you will encounter something more frustrating than the data displayed in the browser but not crawled out. Perhaps it is submitted to the server that has been handled well by the form has been rejected, perhaps their own IP address for some reason directly blocked by the Web site, can not continue to visit.

The reason may be some of the most complex bugs, or they may be unexpected (programs that work on a Web site, but not on another site that looks exactly the same). The most likely scenario is that the other person intentionally does not allow the crawler to crawl information. The website has identified you as a network robot that has been directly rejected, and you cannot find out why.

Next, we will introduce some network-collected black Magic (HTTP headers, CSS and HTML forms, etc.) to overcome the Web site to prevent automatic collection. But let's talk about moral issues first. Morality and etiquette of web crawler

To tell you the truth, from a moral standpoint, writing the following words is not easy. My own site has been harassed many times by web bots, spam generators, web crawlers, and other unwelcome virtual visitors, and your site may be the same. In that case, why do you want to introduce the more powerful network robots? There are several very important reasons. White hat work. In the collection of sites that do not want to be collected, there are some very ethical and legal reasons. For example, my previous job was to do the web crawler, I have done an automatic information collector, from the unlicensed web site automatically collect the customer's name, address, phone number and other personal information, and then submit the collected information to the website, let the server delete these customer information. In order to avoid competition, these websites will be strict to the network crawler to defend. However, my job is to make sure that the company's clients are anonymous (those who are victims of domestic violence or who want to keep a low profile for other reasons), which creates extremely reasonable conditions for network data collection, and I am happy to be able to do it.
Although it is unlikely to build a fully "crawler-less" site (at least to make it easy for legitimate users to access the site), I hope the following will help people protect their sites from malicious attacks. The following will point out the drawbacks of each of these network data acquisition techniques that you can use to protect your site. In fact, most of the network robots in the beginning can only do a number of broad information and vulnerability scanning, followed by a few simple techniques can block 99% of the robot. However, they evolve very quickly, and are best prepared to meet new attacks at all times.
Like most programmers, I never believed that banning the spread of a certain type of information would make the world more harmonious.

Before you read, keep in mind that many of the programs and techniques presented here should not be used on the site. Reptile Black Technology: Network robot looks like some of the human user's methods

The premise of website anti-acquisition is to distinguish human access user and network robot correctly. Although the site can use a lot of identification technology (such as verification code) to prevent the crawler, but there are some very simple ways to make your network robot to look more like human access users. 1. Construct a reasonable HTTP request header

In addition to processing Web forms, the requests module is a powerful tool for setting request headers. The HTTP request header is a set of properties and configuration information that is passed each time you send a request to a network server. HTTP defines more than 10 odd types of request headers, but most of them are not commonly used. Only the following seven fields are used by most browsers to initialize all network requests (the information in the table is my own browser's data).

When using the Urllib standard library, the classic Python crawler sends the following request headers:

If you are a crawler-guarding webmaster, which request header will you have to visit your site?

Installation requests can find the download link (http://docs.python-requests.org/en/latest/user/install/) and installation method on the module's Web site, or install it with any Third-party Python module installer.

The request header can be customized through the requests module. The https://www.whatismybrowser.com/Web site is a great website that allows the server to test the properties of the browser. We use the following procedures to collect information on this site, verify our browser cookie settings:

The request header in the output result of the program should be the same as the headers set in the program.

Although the site may have a "human" check on each attribute of the HTTP request header, I find that the most important parameter is user-agent. No matter what project you do, you must remember to set the User-agent attribute into a content that is not easy to arouse suspicion and not to use python-urllib/3.4. Also, if you're dealing with a highly alert site, be aware of requests that are often used but rarely checked, such as the Accept-language attribute, which is perhaps the key to the site's judgment that you are a human visitor.

The request header will change the way you view the Web world.
Suppose you want to write a language translator for a machine-learning research project, but don't have a lot of translated text to test its effects. Many large Web sites offer different language translations for the same content, and respond to different language versions of the site based on the parameters of the request headers. So, if you simply modify the request header attribute from accept-language:en-us to ACCEPT-LANGUAGE:FR, you can get "Bonjour" from the website (French, Hello) These data are used to improve the translation effect of the translation machine (large multinational companies are usually good acquisition objects).
The request header can also let the site change the layout style of the content. For example, when browsing a Web site with a mobile device, you typically see a simplified version of the Web site without ads, flash, and other distractions. So, change your request head user-agent to below so you can see a site that is easier to collect.
user-agent:mozilla/5.0 (IPhone;  CPU iPhone os 712 like Mac os X) app. lewebkit/537.51.2 (khtml, like Gecko) version/7.0 mobile/11d257 2. Learning to set Cookies

Although cookies are a double-edged sword, handling cookies correctly can avoid many collection problems. Web sites use cookies to track your access, and if you find a reptilian anomaly, you'll be interrupted, such as filling out a form quickly or browsing a large number of pages. Although these behaviors can be disguised by shutting down and reconnecting or changing the IP address, if the cookie exposes your identity, any effort is futile.

Cookies are essential when collecting sites. To remain logged on on a Web site, you need to save a cookie on multiple pages. Some sites do not require a new cookie to be obtained every time they log on, as long as an old "logged in" Cookie is saved.

If you are collecting one or more target sites, it is recommended that you check the cookies generated by these sites, and then think about which cookies the crawler needs to handle. There are some browser plug-ins that can show you how cookies are set when you visit and leave your site. Editthiscookie (http://www.editthiscookie.com/) is one of my favorite Chrome browser plug-ins.

Because the requests module cannot execute JavaScript, it cannot handle the cookies generated by many new tracking software, such as Googleanalytics, Cookies are set only when client script is executed (or when the user browses the page to generate a cookie based on a page event, such as a click button). To handle these actions, you need to use selenium and PHANTOMJS packages.

Selenium and Phantomjs
Selenium (http://www.seleniumhq.org/) is a powerful network data acquisition tool, originally developed for Web site automation testing. It has also been widely used in recent years to get accurate snapshots of web sites, as they can be run directly on browsers. Selenium allows browsers to automatically load pages, get the data they need, even screen screenshots, or determine whether certain actions on the site occur.
Selenium itself without a browser, it needs to be used in conjunction with Third-party browsers. For example, if you run selenium on Firefox, you can see the Firefox window open, go to the site, and execute the action you set in your code. Although this can be seen more clearly, but I prefer to let the program run in the background, so I phantomjs (http://phantomjs.org/download.html) instead of the real browser.
Phantomjs is a "headless" (headless) browser. It loads the Web site into memory and executes JavaScript on the page, but does not show the user the graphical interface of the page. By combining selenium and PHANTOMJS, you can run a very powerful web crawler that can handle cookies, Javascrip, headers, and anything you need to do.
You can download the selenium library from the PyPI Web site (https://pypi.python.org/simple/selenium/), or you can install it using a command line with a Third-party manager (like PIP).

You can call the Webdriver Get_cookie () method on any Web site (in this case, http://pythonscraping.com) to view the cookie:

This allows you to get a very typical list of Google Analytics cookies:

You can also call the Deletecookie (), Addcookie (), and Deleteallcookies () methods to handle cookies. In addition, you can save cookies for use by other web crawlers. The following example shows how to combine these functions:

In this example, the first webdriver gets a Web site, prints cookies and saves them to the variable savedcookies. The second webdriver loads the same Web site (Technical Note: You must first load the Web site so that selenium can know which site the cookie belongs to, even if the behavior of loading the site is not useful to us), delete all cookies, Then replace the cookie that was obtained by the first webdriver. When the page is loaded again, the timestamp, source code, and other information of the two groups of cookies should be exactly the same. From a googleanalytics point of view, the second webdriver is now exactly the same as the first webdriver. 3. Normal Time Access Path

There are a number of sites with complete safeguards that may prevent you from submitting your form quickly or interacting with the site quickly. Even without these security measures, downloading a large amount of information from a Web site at a faster rate than the average person could have been blocked by the site.

So, while a multithreaded program might be a good way to load a page quickly--working with data in one thread and loading pages in another--it's a scary strategy for a good crawler. Or you should try to ensure that the load of the page loads and the data request is minimized. If conditions permit, try to add a little time interval to each page access, even if you want to add one line of code:

**

Time.sleep (3)
(Small series: 3 + random number is not better.) )

**

Reasonable control of speed is a rule that you should not break. Excessive consumption of other people's server resources can put you in an illegal position, and, more seriously, it may drag a small web site down or down the line. It is immoral and downright wrong to drag down the site. So please control the acquisition speed. common form of SLR crawler security decryption

Many of the tools, such as litmus, have been used for many years, and are still used to differentiate between web crawlers and browser-using human visitors, all of which have achieved varying degrees of effectiveness. While it's not a big deal for web bots to download some of the public articles and blog posts, it's a big problem if web bots create thousands of of accounts on your site and start sending spam to all users. Web forms, especially those used for account creation and login, if the robot is wantonly abused, the site's security and traffic costs will face a serious threat, so efforts to limit access to the site is most in line with the interests of many site owners (at least they think so).

These anti robot security measures, which focus on forms and login links, are indeed a serious challenge for web crawlers. 4. Note Implicitly enter field values

In an HTML form, the implied field makes the value of a field visible to the browser, but is not visible to the user (unless you look at the page source code). As more and more Web sites start using cookies to store state variables to manage user status, the hidden fields are primarily used to prevent the crawler from automatically submitting the form until another best use is found.

The following illustration shows an example of an implied field on the Facebook login page. Although there are only three visible fields in the form (username, password, and a confirmation button), the form sends a large amount of information to the server in the source code.

Hidden fields on the Facebook login page

There are two main ways to block network data collection with hidden fields. The first is that a field on a form page can be represented by a random variable generated by the server. If this value is not on the form processing page at the time of submission, the server has reason to think that the submission is not submitted from the original form page, but is submitted directly to the form processing page by a network robot. The best way to get around this problem is to first collect the random variables generated on the page where the form is located, and then submit it to the form processing page.

The second way is the "honeypot" (honey pot). If the form contains an implied field with a common name (set up a honeypot trap), such as "User Name" (username) or "email address", a network robot that is not well designed often doesn't matter whether the field is visible to the user, Fill in this field directly and submit to the server, this will be in the server's Honeypot trap. The server ignores all hidden field values (or values that differ from the default values of the form submission page), and the access user who fills in the implied fields may be blocked by the Web site.

In short, it is sometimes necessary to check the page where the form is located, to see if there are any missing or mistaken hidden fields (honeypot traps) that have been pre-set by the server. If you see some hidden fields, usually with larger random string variables, it is likely that the Web server will check them when the form is submitted. In addition, there are other checks to ensure that the currently generated form variables are used only once or recently generated (this avoids the variable being simply stored in a program for reuse). 5. How reptiles usually avoid the honeypot

While it is easy to distinguish between useful and unwanted information using CSS attributes for network data acquisition (for example, by reading IDs and class tags), this can sometimes be problematic. If a field on a network form is set up to be invisible to the user through CSS, it can be considered that a normal user cannot fill out this field when visiting a Web site because it does not appear on the browser. If this field is filled in, it may be the robot, so this submission will fail.

This method can be applied not only on the Web site's form, but also on links, pictures, files, and any content that can be read by the robot, but not visible to ordinary users in the browser. If a visitor accesses an "implied" content on the site, it triggers the server script to block the user's IP address, kick the user out of the site, or take other steps to prevent the user from accessing the site. In fact, many business models are doing these things.

The following example uses the page in http://pythonscraping.com/pages/itsatrap.html. This page contains two links, one implied through CSS and the other visible. In addition, there are two hidden fields on the page:

These three elements are hidden from the user in three different ways: the first link is to hide the phone number field with a simple CSS property setting display:none name= "Phone" is an implied input field mailbox address field name= "Email" is to move an element to the right 50 000 pixels (should exceed the bounds of the computer monitor) and hide the scroll bar

Because selenium can get access to the content of the page, it can differentiate between visible and implied elements on the page. You can determine whether an element is visible on the page by Is_displayed ().

For example, the following code example takes the contents of the previous page and then looks for implied links and implied input fields:

Selenium grabbed each implied link and field, and the result is as follows:

Although it is unlikely that you will be able to access the hidden links that you have found, before submitting, remember to confirm the values of the hidden fields that are already in the form, ready to commit (or have selenium submitted automatically for you). use a remote server to avoid IP blocking

People who enable remote platforms typically have two purposes: the need for greater computing power and flexibility, and the need for variable IP addresses. 6. Using a variable remote IP address

The first rule of creating web crawlers is that all information can be forged. You can use a non-personal mailbox to send mail, through the command line to automate the behavior of the mouse, or through IE 5.0 browser consumption of web traffic to scare network management.

But there's one thing that can't be faked, that's your IP address. Anyone can write to you with this address: "1600th northwest of Pennsylvania Avenue, Washington, D.C., president, ZIP code 20500." "But if this letter was sent from the Albuquerque of New Mexico State, then you can be sure you are not the President of the United States who wrote to you."

Technically, an IP address can be disguised by sending a packet, the distributed Denial-of-service Attack technology (distributed denial of Service,ddos), and the attacker does not need to care about the packets received (so that a fake IP address can be used when sending the request). However, network data acquisition is a need to care about the response of the server behavior, so we believe that IP address can not be fake.

The focus of stopping the site from being collected is mainly on identifying the behavioral differences between humans and robots. Ban IP address such overkill behavior, it is as if farmers do not rely on spraying pesticides to the crops to kill insects, but directly with the fire completely solve the problem. It is the last move, but it is a very effective method, as long as the risk of ignoring the IP address sent packets on it. However, there are several problems with this approach. IP address access lists are difficult to maintain. Although most large web sites use their own programs to automatically manage IP address lists (robots kill robots), at least people need to check the list occasionally, or at least monitor the growth of the problem.
Because the server needs to check every packet that is ready to be received based on the I address list, there is an extra processing time to check for incoming packets. Multiple IP addresses multiplied by a large number of packets will increase the check time exponentially. To reduce processing time and complexity, administrators typically group IP addresses into groups and make rules, such as if there are some dangerous elements in the IP set that "kill all 256 addresses in this interval." So the next question arises.
Blocking an IP address can cause unintended consequences. For example, when I was an undergraduate at the Orin Engineering College in Massachusetts, a classmate wrote a software that could vote on popular content on the http://digg.com/website (which everyone uses on Digg before the Reddit pop). This software server IP address is blocked by Digg, resulting in the entire site can not access. So the classmate moved the software to another server, and Digg itself lost many of the main target users.

Despite these drawbacks, blocking IP addresses is still a very common tool that server administrators use to prevent the intrusion of suspicious web crawlers into the server. Tor Proxy Server

Onion Routing (The Onion Router) network, commonly abbreviated as TOR, is an anonymous means of IP address. An onion router network, built by a network of volunteer servers, consists of multiple layers (like onions) with different servers to wrap the client in the innermost. Data is encrypted before it enters the network, so no server can steal traffic data. In addition, although each server's inbound and outbound traffic can be traced, but to find out the true starting point and end of communication, you must know the entire communication link all the server's inbound and outbound communication details, and this is basically impossible to achieve. > Limitations of Tor anonymity

>

While the purpose of Tor in this article is to change the IP address rather than achieve total anonymity, it is necessary to focus on the power and inadequacy of the Tor anonymous approach.
Although the Tor network allows you to visit a site with an IP address that cannot be traced to your IP address, any information you leave to the server on your site will reveal your identity. For example, you log

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.