Web Crawlers (Web spider, Web robot) and Web Security

Source: Internet
Author: User

Comments: Web crawlers, also known as Web Spider or Web Robot, are programs or scripts that automatically capture Web resources according to certain rules, it has been widely used in the Internet field.Web Crawler Overview
Web crawlers, also known as Web Spider or Web Robot, are programs or scripts that automatically capture Web resources according to certain rules, it has been widely used in the Internet field. The search engine uses Web crawlers to capture Web pages, documents, and even images, audios, videos, and other resources. The search engine uses the corresponding indexing technology to organize such information and provide it to search users. With the rapid development of networks, the World Wide Web has become a carrier of a large amount of information. How to effectively extract and utilize such information has become a huge challenge. The constantly optimized web crawler technology is effectively coping with this challenge, providing powerful support for the specific fields and topics that users are concerned about in efficient search. Web Crawlers also provide an effective way to promote small and medium-sized websites. The optimization of websites for search engine crawlers has been popular for a while.
A traditional Web Crawler obtains the URL on the initial web page from the URL of one or more initial web pages (the Universal Resource Locator unified Resource Locator, constantly extract new URLs from the current page and put them into the queue until certain system conditions are met to stop crawling. At this stage, web crawlers have developed intelligent tools that cover the comprehensive application of various methods, such as webpage data extraction, machine learning, data mining, and semantic understanding.
  Security issues of Web Crawlers
Because the Web crawler policy is to "crawl" as much valuable information as possible on the website, as many pages as possible will be accessed according to the specific policy, occupying network bandwidth and increasing the processing overhead of the Web server, webmasters of many small sites find that when web crawlers patronize, access traffic will increase significantly. Malicious users can use crawlers to launch DoS attacks on Web sites, so that Web services can run out of resources without providing normal services. Malicious users may also crawl various sensitive data for improper purposes through web crawlers, mainly in the following aspects:
1) Search for the Directory List
Many Web servers on the Internet return a directory list when a client requests a directory with no tabs on the site. This directory list usually includes directories and file links that users can click to access the directory at the next layer and the files in the current directory. Therefore, by capturing the directory list, malicious users can often obtain a large amount of useful information, including the directory structure of the site, sensitive files, and Web server design architecture and configuration information, for example, configuration files, log files, password files, and database files used by the program may be crawled by web crawlers. This information can be used as an important information to select an attack target or directly intrude into the site.
2) Search for test pages, manual documents, sample programs, and possibly defective programs
Most Web server software comes with test pages, help documents, sample programs, and Backdoor programs for debugging. These files often leak a large amount of system information and even provide methods to directly access Web service data without authentication, becoming an effective source of intelligence for malicious users to analyze and attack Web servers. In addition, the existence of these files also implies that the website has potential security vulnerabilities.
3) Search for the Administrator Logon page
Many network products provide Web-based management interfaces that allow administrators to remotely manage and control them on the Internet. If the Administrator does not take precautions and does not modify the default Administrator name and password of the network product, once the Administrator Logon page is searched by malicious users, network security will face great threats.
4) Search internet users' personal data
The personal information of Internet users includes personal information such as name, ID card number, phone number, Email address, QQ number, and mailing address. After malicious users obtain the personal information, they are prone to attacks or frauds by using social engineering.
Therefore, we should take appropriate measures to restrict the access permissions of web crawlers, open pages to web crawlers, and block sensitive pages, it is extremely important to maintain the secure operation of websites and protect users' privacy.
  Web vulnerability scanning based on Web Crawler Technology
The indirect security threats posed by web crawlers to websites by collecting website information to prepare for illegal access, attacks, or frauds. With the development of security technology, Web crawler technology has been used to directly detect Web vulnerabilities, which directly affects the security of Web servers. Among Web server vulnerabilities, the Cross Site Script Vulnerability accounts for a large proportion of SQL Injection vulnerabilities, both vulnerabilities can be detected by improving web crawlers. Due to lack of sufficient security knowledge, a considerable number of programmers lack sufficient checks on the webpage request content when writing Web applications, which causes many Web applications to have security risks. You can submit a specially crafted URL request containing SQL statements or scripts to obtain sensitive information and even directly modify background data based on the returned results of the program. Based on the current security situation, Web Crawler technology is applied to Web vulnerability scanning, greatly improving the efficiency of vulnerability discovery.
 Web vulnerability scanning based on Web Crawler technology is divided into the following processes:
1) page filtering: captures website pages through automated programs and
Such as the Web pages of tags for URL extraction, these HTML tags contain URL Information, so that malicious users can perform more in-depth Web access or submit operations.
2) URL matching: automatically matches the URLs on the Web page, and extracts dynamic query URLs or submitted URLs composed of parameters for vulnerability detection. Such as dynamic query URL "http://baike.xxxx.com/searchword? Word = frameset & pic = 1 ", where frameset is the dynamic parameter part of the URL, which can be changed. The submit URL is used to submit Web user input to the server for processing. The parameters are mostly user input and can also be changed.
3) Vulnerability Testing: based on the dynamic query URL or submitted URL, parameters are automatically changed in the parameter section, and quotation marks and semicolons are inserted (SQL injection is sensitive to it) and script tags (XSS sensitive), and automatically determines whether a vulnerability exists based on the results returned by the Web server. For example, the dynamic query URL in "URL matching" can be transformed into a http://baike.xxxx.com/searchword? Word = & pic = 1 for Cross-Site Scripting Vulnerability Detection.
How to cope with crawler security threats
Due to the security threats posed by web crawlers, many website administrators are considering limiting or even rejecting crawler access. In fact, it is ideal to treat crawlers differently based on the security and sensitivity of website content. The URL organization of a website should set different URL paths based on whether the website is suitable for making public in a wide range. When the same Web page requires both completely public information and sensitive information, sensitive content should be displayed through links, tags embedded into webpages, and dynamic pages such as static pages with high security evaluation should be separated from URLs if possible. When you restrict crawlers, You can restrict the Security and sensitivity of URLs to different types of crawlers and proxies.
  You can use the following methods to restrict crawlers:
1) set the robots.txt File
The simplest way to restrict crawlers is to set the robots.txt file. The robots.txt file is the first file to be viewed when a search engine crawler accesses a website. It tells the crawler what files can be viewed on the server, such as setting Disallow :/, all paths cannot be viewed. Unfortunately, not all search engine crawlers follow this rule. Therefore, setting only the robots file is not enough.
2) User Agent Identification and restrictions
To restrict the crawler access to the robots.txt file, you must first distinguish the crawler traffic from the access traffic of common users, that is, identify them. Generally, crawlers can identify the User Agent field in their HTTP requests, this field allows the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, and browser plug-in used by the customer. Crawler User Agent field is generally different from the browser, such as Google search engine crawler User Agent field will have a string similar to Googlebot, such as User-Agent: Googlebot/2.1 (http://www.google.com/bot.html ), baidu search engine crawlers have strings similar to Baiduspider. Many Web server software such as Apache can be configured with the User Agent field for access filtering, which can effectively limit the access of most crawlers.
3) Identify and restrict access Behavior Features
The User Agent field of an HTTP request is deliberately disguised as a web browser crawler and can be identified by its access behavior characteristics. Generally, crawlers have a high frequency of regular access, which is different from the randomness and low frequency of browsing by real users. The limits on this type of crawler are similar to the DDoS attack defense principles and are based on statistics. Restrictions on this type of crawler can only be achieved through application identification devices, IPS, and other network settings that can be used for deep identification. Network devices are used to restrict web crawlers. They are not only comprehensive, but also suitable for unified management in the case of multiple servers, so as to avoid the possibility of omissions caused by separate management of multiple servers.
  Conclusion
Web crawlers and their corresponding technologies bring a considerable amount of access to the website, but also bring direct and indirect security threats. More and more websites are paying attention to the restrictions on web crawlers. With the rapid development of the Internet, there will be more and more Internet applications based on web crawlers and search engine technologies. It is necessary for website administrators and security personnel to understand the principles and restrictions of web crawlers, ready to handle a variety of Web Crawlers
TechTarget China

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.