Why does a large number of Web sites not crawl? 6 common ways to break through blocks-reproduced

Source: Internet
Author: User
Tags html form 403 forbidden error block ip address cpanel email account

Portal: http://www.cnblogs.com/junrong624/p/5533655.html

Automatic data collection (crawling) on the internet is almost as long as the internet exists. Today, the public seems to be more inclined to use "network data collection", sometimes the network data acquisition program is called the Network Robot (bots). The most common method is to write an automated program to request data from a Web server (usually with an HTML form or other Web page file), and then parse the data to extract the required information.

This article assumes that the reader has learned how to use code to crawl a remote URL, and how the form submits and how JavaScript works in the browser. To learn more about the basic knowledge of network data collection, you can refer to the information after the text.

When collecting a website, you will encounter something more frustrating than the data displayed on the browser but not crawling. Perhaps it is to submit to the server from a well-handled form is rejected, perhaps its own IP address is not known for what reason directly blocked by the site, can not continue to access.

The reasons may be some of the most complex bugs, or they may be unexpected (the program works on one site, but not on another site that looks exactly the same). The most likely scenario is that the other party intentionally does not allow the crawler to crawl information. The website has identified you as a network robot that has been rejected directly , and you cannot find out why.

The next step is to introduce some of the black magic (HTTP headers, CSS, HTML forms, etc.) collected by the network to overcome the Web site blocking automatic collection. But let's talk about moral issues first.

the morals and etiquette of web crawler

To tell the truth, it is not easy to write the following words from a moral point of view. My own website has been harassed many times by web bots, spam generators, web crawlers, and other unwelcome virtual visitors, and your site may be the same. Why, then, should we introduce the more powerful network robots? There are several very important reasons.

    • White hat work . There are some very ethical and legal reasons for collecting websites that don't want to be collected. For example, my previous job is to be a web crawler, I have done an automatic information collector, from an unlicensed website automatically collect the customer's name, address, telephone number and other personal information, and then submit the collected information to the website, let the server delete these customer information. In order to avoid competition, these sites will be on the web crawler to defend. However, my job is to make sure that the company's clients are anonymous (those who are victims of domestic violence, or who want to keep a low profile for other reasons), which creates extremely reasonable conditions for network data collection and I am happy to be able to do the job.

    • Although it is not likely to create a completely "anti-crawler" site (at least to make it easy for legitimate users to access the site), I hope that the following can help people protect their sites from malicious attacks. The following points out the drawbacks of each of the network data acquisition technologies that you can use to protect your website. in fact, most web robots can only start with a wide range of information and vulnerability scanning, followed by a few simple techniques can block 99% of robots. However, they evolve very quickly and are best prepared to meet new attacks.

    • Like most programmers, I never believed that banning the spread of a certain type of information would make the world more harmonious .

Before reading, keep in mind that many of the programs and techniques presented here should not be used on the site.

Reptile Black Technology: network robots look like some way of human users

The premise of website anti-collection is to distinguish human access user and network robot correctly. While websites can use many identification techniques (such as verification codes) to prevent crawlers, there are some very simple ways to make your web robot look more like a human user.

1. Construct a reasonable HTTP request header

In addition to working with site forms, the requests module is a powerful tool for setting up request headers. The HTTP request header is a set of properties and configuration information that is passed each time you send a request to the Web server. HTTP defines more than 10 types of wacky request headers, but most of them are not used. Only the following seven fields are used by most browsers to initialize all network requests (the information in the table is my own browser data).

When using the Urllib standard library, the classic Python crawler will send the following request headers:

If you are a crawler webmaster, which request will you ask to visit your site?

Installing requests

You can find the download link (http://docs.python-requests.org/en/latest/user/install/) and installation method on the module's website, or install it with any third-party Python module installer.

The request header can be customized through the requests module. The https://www.whatismybrowser.com/website is a great website that allows the server to test the browser's properties. We use the following procedure to collect information about this website and verify the cookie settings of our browser:

The request header in the output of the program should be the same as the headers set in the program.

Although the Web site may have a "human nature" check on each property of the HTTP request header, I find that the usually really important parameter is user-agent. Whatever you do, be sure to set the User-agent property to something that is not likely to cause suspicion, and do not use python-urllib/3.4. Also, if you are dealing with a site that is very alert, be aware of those requests that are frequently used but rarely checked, such as the Accept-language attribute, which is the key to the site's judgment that you are a human visitor.

The request header will change the way you watch the online world

Suppose you want to write a language translator for a machine-learning project, but don't have a lot of translated text to test its effects. Many large sites will provide different language translations for the same content, responding to different language versions of the site, depending on the parameters of the request header. As a result, you can get "Bonjour" from the website simply by modifying the request header property from accept-language:en-us to accept-language:fr(French, Hello) These data are used to improve the translation effect of translation machines (large multinational companies are usually good targets for acquisitions).

The request header also allows the site to change the layout style of the content. For example, when browsing a website with a mobile device, you'll typically see a simplified version of the site without ads, Flash, and other distractions. So, change your request head user-agent to the following so that you can see a more easily collected site!

user-agent:mozilla/5.0 (IPhone; CPU iPhone os 7_1_2 like Mac os X apps lewebkit/537.51.2 (khtml, like Gecko) version/7.0 mobile/11d257 safari/9537.53

2. Knowledge of setting Cookies

Although cookies are a double-edged sword, it is possible to handle cookies correctly to avoid many acquisition problems. The Web site uses cookies to track your access, and if you find a reptile anomaly, it can disrupt your visit, such as filling out a form in a very fast way, or browsing a large number of pages. While these behaviors can be disguised by shutting down and reconnecting or changing the IP address, if the cookie exposes you, it's a waste of effort.

Cookies are essential when collecting websites. To remain logged on on a site, you need to save a cookie on multiple pages. Some sites do not require a new cookie to be accessed every time they log on, as long as an old "logged in" Cookie is saved.

If you are collecting one or several target sites, it is recommended that you check the cookies generated by these sites and then think about which cookie is the crawler needs to deal with. Some browser plugins can show you how cookies are set when you visit a website and leave the site. Editthiscookie (http://www.editthiscookie.com/) is one of my favorite Chrome browser plugins.

Because the requests module cannot execute JavaScript, it cannot handle many of the new cookies generated by the tracking software, such as Google Analytics, which only sets cookies when the client script executes (or when the user browses the page based on the page event Cookies, such as a click of a button). To handle these actions, you need to use Selenium and PHANTOMJS packages.

Selenium and Phantomjs

Selenium (http://www.seleniumhq.org/) is a powerful network data acquisition tool that was originally developed for Web site automation testing. In recent years, it has also been widely used to obtain accurate site snapshots, as they can be run directly on the browser. Selenium allows the browser to automatically load pages, get the data you need, even screen screenshots, or determine whether certain actions on your site occur.

Selenium does not have a browser, it needs to be used in conjunction with a third-party browser. For example, if you run Selenium on Firefox, you can directly see that the Firefox window is open, go to the website, and then perform the actions you set in the code. Although this can be seen more clearly, but I prefer to let the program run in the background, so I phantomjs (http://phantomjs.org/download.html) instead of the real browser.

Phantomjs is a "headless" (headless) browser. It loads the site into memory and executes the JavaScript on the page, but does not show the user the graphical interface of the page. By combining Selenium and PHANTOMJS, you can run a very powerful web crawler that can handle cookies, Javascrip, headers, and anything you need to do.

You can download the selenium library from the PyPI website (https://pypi.python.org/simple/selenium/) or use a third-party manager (like PIP) to install it on the command line.

You can call Webdriver's Get_cookie () method on any Web site (in this case, http://pythonscraping.com) to view the cookie:

Click to view larger image

This allows you to get a very typical list of cookies for Google Analytics:

Click to view larger image

You can also call the Delete_cookie (), Add_cookie (), and Delete_all_cookies () methods to handle cookies. In addition, cookies can be saved for use by other web crawlers. The following example shows how these functions are grouped together:

Click to view larger image

In this example, the first webdriver gets a Web site, prints the cookies and saves them in the variable savedcookies. The second webdriver loads the same Web site (technical tip: You must first load the site so that Selenium can know which website the cookie belongs to, even if the behavior of loading the site is of no use to us), delete all cookies and replace them with the first Webdriver Get the cookie. When the page is loaded again, the timestamps, source code, and other information for the two sets of cookies should be exactly the same. From a Google Analytics perspective, the second webdriver is now exactly the same as the first webdriver.

3. Normal Time Access Path

Some well-guarded websites may prevent you from submitting forms quickly, or interacting with the site quickly. Even without these security measures, downloading a lot of information from a website at a much faster rate than the average person can also be blocked by the site.

So, while a multithreaded program can be a great way to quickly load pages--working with data in one thread and loading pages in another--this is a scary strategy for well-written crawlers. You should try to ensure that the page load is loaded once and the data requests are minimized. If conditions permit, try to add a little time interval for each page access, even if you want to add a line of code:

time.sleep(3)

(Small series: 3 + random number is not better?) )

Reasonable control of speed is a rule that you should not break. Excessive consumption of other people's server resources will put you in an illegal situation, more serious is to do so may be a small web site to drag down or even offline. It is immoral and downright wrong to drag down the website. So please control the acquisition speed!

Common table SLR crawler security Decryption

Many test tools such as litmus have been used for many years and are still used to differentiate between web crawlers and human visitors using browsers, which have achieved varying degrees of effectiveness. While it's not a big deal for web bots to download public articles and blogs, it's a big problem if web bots create thousands of accounts on your site and start sending spam to all users. Web forms, especially those used for account creation and login, can be a serious threat to website security and traffic costs if misused by bots, so trying to limit access to websites is best for many site owners (at least they think so).

These anti-robot security measures, which focus on forms and login links, are a serious challenge for web crawlers.

4. Note The implied input field value

In an HTML form, the implied field allows the value of a field to be visible to the browser, but not visible to the user (unless you look at the source code of the page). As more and more websites begin to use cookies to store state variables to manage user state, hidden fields are primarily used to prevent crawlers from automatically submitting forms until another best use is found.

The example shown is the hidden field on the Facebook login page. Although there are only three visible fields in the form (username, password, and a confirmation button), the form in the source code sends a large amount of information to the server.

Hidden fields on the Facebook sign-in page

There are two main ways to block network data collection with hidden fields. The first is that a field on a form page can be represented by a random variable generated by the server. If this value is not on the form-processing page at the time of submission, the server has reason to assume that the commit was not submitted from the original form page, but rather was submitted directly to the form processing page by a network robot. The best way to circumvent this problem is to first collect the random variables generated on the page where the form is located before committing to the form processing page.

The second way is "honeypot" (honey pot). If the form contains an implicit field with an ordinary name (setting a honeypot trap), such as "username" (username) or "email address" (e-mail addresses), the design of the network robot is not very good, regardless of whether the field is visible to the user, Fill out this field directly and submit it to the server so that it will be in the server's Honeypot trap. The server ignores the actual values of all implied fields (or values that differ from the default values of the form submission page), and the user who fills in the hidden fields may be blocked by the site.

In summary, it is sometimes necessary to check the page where the form is located, to see if there are any missing or mistaken server pre-programmed hidden fields (honeypot traps). If you see some hidden fields, usually with large random string variables, it is likely that the Web server will check them when the form is submitted. In addition, there are other checks to ensure that the currently generated form variables are only used once or recently generated (this avoids the variable being simply stored in a program and reused).

5. How reptiles usually avoid honeypot

Although it is easy to use CSS attributes to distinguish between useful and useless information when conducting network data collection (for example, by reading the ID and class tags for information), this can sometimes be problematic. If a field in a Web form is set to be invisible to the user via CSS, you may not be able to fill out this field when a normal user visits the site because it does not appear on the browser. If this field is filled in, it could be a robot, so the commit will not work.

This approach can be applied not only to forms on the Web site, but also to links, images, files, and any content that can be read by bots, but not visible to ordinary users on the browser. If a visitor accesses an "implied" content on the site, it triggers a server script to block the user's IP address, kick the user out of the site, or take other measures to prevent the user from accessing the site. In fact, many business models are doing these things.

The following example uses the Web page in http://pythonscraping.com/pages/itsatrap.html. This page contains two links, one implied by CSS and the other visible. Additionally, there are two hidden fields on the page:


Click to view larger image

These three elements are hidden from the user in three different ways:

    • The first link is hidden through a simple CSS property setting display:none

    • The Phone number field name="phone" is an implied input field

    • The Mailbox Address field name="email" is to move the element to the right 50 000 pixels (it should go beyond the boundaries of the computer monitor) and hide the scroll bar

Because Selenium can get access to the contents of a page, it can distinguish between visible and suppressed elements on the page. You can determine whether an element is visible on the page by Is_displayed ().

For example, the following code example gets the contents of the previous page, and then finds the implied links and implied input fields:

Click to view larger image

Selenium grabs each implied link and field, and the results are as follows:

Click to view larger image

Although you are unlikely to visit the hidden links you find, before submitting, remember to confirm the values of the hidden fields that are already in the form, ready to be submitted (or let Selenium submit for you automatically).

use a remote server to avoid IP blocking

People who enable remote platforms typically have two purposes: the need for greater computing power and flexibility, and the need for variable IP addresses .

6. Use a variable remote IP address

The first principle of creating a web crawler is that all information can be forged. You can use a non-personal mailbox to send mail, through the command line to automate the mouse behavior, or through the Internet Explorer 5.0 browser consumption of web traffic to scare network management.

But there's one thing you can't fake, that's your IP address. Anyone can write to you at this address: "1600th Northwest Pennsylvania Avenue, Washington, D.C., president, ZIP code 20500." "But if this letter is from the Albuquerque of New Mexico State, then surely you can be sure that it is not the President of the United States that wrote to you."

Technically, the IP address can be spoofed by sending packets, which is the distributed denial of service attack technology (distributed denial of Service,ddos), and the attacker does not need to care about the received packets (so that a false IP address can be used when sending the request). However, the network data acquisition is a need to care about the server response behavior, so we think the IP address can not be fake.

The focus on preventing the site from being collected is focused on identifying human and robotic behavioral differences. Block IP address this overkill behavior, as if farmers do not rely on spraying pesticides to the crops to kill insects, but directly with the fire to completely solve the problem. It's the last move, but it's a very effective way to ignore packets from a dangerous IP address. However, there are several issues that you might encounter with this approach.

    • IP address access lists are difficult to maintain. While most large web sites use their own programs to automatically manage IP address access lists (robots kill bots), at least someone needs to check the list occasionally, or at least monitor the growth of the problem.

    • Because the server needs to access the list based on the IP address to check each packet that is ready to receive, it will add some additional processing time to check the received packets. Multiplying the number of IP addresses by a large amount of packets will increase the check-time exponentially. To reduce processing time and processing complexity, administrators typically group IP addresses and set rules, such as if there are some dangerous molecules in this set of IPs that "block all 256 addresses in this range". The next problem arises.

    • Blocking an IP address can cause unintended consequences. For example, when I was a graduate student at the Massachusetts Orin Engineering College, a classmate wrote a software that could vote on popular content on the http://digg.com/website (Digg before the Reddit popular). The server IP address of this software was blocked by Digg, which makes the whole website inaccessible. The classmate then moved the software to another server, and Digg lost many of its main target users ' traffic.

Despite these shortcomings, the blocking of IP addresses is still a very common means that server administrators use to block suspicious web crawlers from invading servers.

Tor Proxy Server

Onion Routing (The Onion Router) network, commonly abbreviated as Tor, is an anonymous means of IP address. A network of onion routers built by a network of volunteer servers, consisting of multiple tiers of different servers (like onions) to wrap the client in the innermost. Data is encrypted before it enters the network, so no server can steal communication data. In addition, although each server's inbound and outbound traffic can be detected, but to find out the true start and end of communication, you must know the entire communication link all the server inbound and outbound communication details, and this is basically impossible to achieve.

 Limitations of Tor anonymity

Although the purpose of Tor in this article is to change the IP address rather than to achieve full anonymity, it is important to look at the capabilities and weaknesses of the Tor anonymous approach.

Although the Tor network allows you to visit the website when the IP address displayed is a cannot track to your IP address, but any information you leave on the site to the server will reveal your identity. For example, when you sign in to your Gmail account and then search with Google, those search histories will be tied to your identity.

In addition, the behavior of logging in to Tor may put your anonymity at risk. In December 2013, a Harvard undergraduate wanted to evade the final exam by sending a bomb threat letter to the school through the Tor network with an anonymous email account. As a result, the IT department at Harvard University in the journal found that when the bomb threat was sent, the traffic from the Tor network came from only one machine and was enrolled by a school student. Although they cannot determine the initial source of traffic (only known to be sent through Tor), the time and the registration information is fully substantiated, and only one machine is logged in during that time, there is good reason to sue the student.

Logging in to the Tor network is not an automatic anonymous measure, nor does it allow you to enter any area on the Internet. Although it is a useful tool, it must be used with caution, sobriety, and ethical discipline.

Using Tor in Python requires that you install and run Tor first, as described in the next section. The Tor service is easy to install and open. Just go to the Tor download page to download and install it and connect it when you open it. Note, however, that when you use Tor, the network speed slows down. This is because the agent may have to first turn around the world network several times before the destination!

Pysocks

Pysocks is a very simple Python Proxy Server communication module that works with Tor. You can download it from its website (https://pypi.python.org/pypi/PySocks) or install it using any third-party module manager.

The use of this module is simple. The sample code is shown below. At run time, the Tor service must be running on port 9150 (the default value):

The Web site http://icanhazip.com/displays the IP address of the Web server to which the client is connected and can be used to test if Tor is functioning properly. When the program executes, the IP address displayed is not your original IP.

If you want to use Selenium and Phantomjs in Tor, do not need to pysocks, just ensure that Tor is running, and then add the service_args parameters to set the proxy port, so that the Selenium through port 9150 to connect the site:

As before, the IP address printed by this program is not your original, but the IP address you get through the Tor client.

Run from Site host

If you own a personal website or company website, you probably already know how to use an external server to run your web crawler. Even if there are some relatively closed network servers, there is no command line access available, you can also control the application through the Web interface.

If your site is deployed on a Linux server, Python should already be running. If you're using Windows server, you might not be so lucky; you need to check to see if Python is installed, or ask the network administrator to install it.

Most small network hosts will provide a software called CPanel, which provides basic management functions and information for website management and background services. If you have access to CPanel, you can set up Python to run on the server-go to "Apache handlers" and add a handler (if not yet):

This tells the server that all Python scripts will be run as a CGI script. CGI is the Universal Gateway Interface (Common Gateway Interface), which is any program that can be run on the server, dynamically generating content and displaying it on the web. Explicitly defining a Python script as a CGI script is to give the server permissions to execute a python script, rather than just displaying them on the browser or letting the user download them.

After writing the Python script, upload it to the server and set the file permissions to 755 to allow it to execute. The program can be executed by using the browser to find the location of the program upload (you can also write a crawler to do this automatically). If you are concerned that scripting in the public domain is unsafe, you can take the following two ways.

    • Store the script in a cryptic or deep URL, making sure that no other URL links are connected to the script, which prevents the search engine from discovering it.

    • Protect the script with a password, or confirm it with a password or an encrypted token before executing the script.

Indeed, it is a bit complicated to run a Python script from a service that is primarily intended to display a Web site. For example, you might find that the web crawler is loading slower when it runs. In fact, the page will not load until the entire collection task is complete (until the output of all "print" statements is displayed). This may take a few minutes, a few hours, or even never complete, to see the specifics of the program. Although it will eventually accomplish the task, you may want to see the results in real time, so you need a real server.

  running from a cloud host

While the cost of cloud computing may be bottomless, when writing this article, it is cheapest to start a compute instance for as long as 1.3 cents per hour (Amazon EC2 's micro instances, other instances will be more expensive), and Google's cheapest compute instance is 4.5 cents per hour, at least 10 minutes. Considering the scale effect of computing power, the cost of buying a small cloud instance from a big company should be about the same as buying a professional physical machine--but using cloud computing doesn't require hiring people to maintain the device.

After you set up your compute instance, you have a new IP address, a user name, and a public-private key that you can use to connect to the instance via SSH. Everything you need to do in the future should be the same as what you do on the physical server--of course, you don't need to worry about hardware maintenance or running complex, redundant monitoring tools.

Summary List of common causes for crawler blocking

If you have been blocked by the site and can't find the reason, then there is a checklist to help you diagnose where the problem is.

  • First, check the JavaScript . If the page you receive from a Web server is blank, lacks information, or encounters a situation where he does not meet your expectations (or what you see on the browser), it is possible that the JavaScript execution of the site Creation page is problematic.

  • Check the parameters of the normal browser submission. If you are ready to submit a form or make a request to a website, POST Remember to check the contents of the page to see if each field you want to submit is already filled in and formatted correctly. Use the Chrome Web Panel (shortcut F12 to open the developer console and then click "Network" to see) to see the commands sent to the website POST , confirming that each of your parameters is correct.

  • is there a valid Cookie? If you are already logged in to the website and cannot remain logged in, or if there are other "login status" exceptions on the website, please check your cookie. Verify that the cookie is called correctly when each page is loaded, and that your cookie is sent to the website each time the request is initiated.

  • IP is banned? If you encounter an HTTP error on the client, especially a 403 Forbidden error, this may indicate that the website has treated your IP as a bot and no longer accepts any requests from you. You either have to wait for your IP address to be removed from the blacklist, or change the IP address (you can go to Starbucks to surf the web). If you are sure that you are not blocked, then check the contents below.

      • Make sure your crawler is not very fast on the website. Fast collection is a bad habit, it will be a heavy burden on the server network management, but also let you fall into the law, but also the IP is blacklisted by the site the first reason. Add a delay to your crawler and let them run in the dead of night. Remember: rushing to write or collect data is a poor project management performance; You should plan ahead and avoid getting flustered.

      • There is one more thing to do: Modify your request header! Some websites will block any visitor who claims to be a reptile. If you are not sure what the value of the request header is, use your own browser's request header.

      • Confirm that you have not clicked or accessed any information that human users normally cannot click or access.

      • If you use a lot of complicated means to access the site, consider contacting the webmaster, tell them your purpose. Try sending email to [email protected]< domain name > or [email protected]< domain name], request network management allows you to use crawlers to collect data. The administrator is also a person!

"The above content is compiled from the Python Network data collection 10th, 12, 14 chapters"


Ryan Mitchell

Translator: Tao Junjie, Chen Xiaoli

Price: 59

    • Original book 4.6 star praise, a book to take care of data collection

    • Covers data capture, data mining, and data analysis

    • Provide detailed code examples to quickly solve real-world problems

The amount of data on the network is increasing, it is more and more difficult to access information by browsing the web, and how to extract and utilize information effectively has become a huge challenge.

This book uses the simple and powerful Python language, introduces the network data collection, and provides comprehensive guidance for collecting various data types in the modern network. The first part focuses on the basic principles of network data acquisition: How to request information from a Web server using Python, how to handle the response of the server, and how to interact with the website in an automated way. The second part describes how to test Web sites with web crawlers, automate processing, and how to access the network in more ways.

Why does a large number of Web sites not crawl? 6 common ways to break through blocks-reproduced

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.