"HTTP" Web bot

Source: Internet
Author: User
Tags response code

"HTTP authoritative guide" Learning Summary

Self-active (self-animating) User agent for Web robot.

Web bots are software programs that automate a series of web transactions without human intervention, alias "crawlers" (crawler), "Spiders", "worms".

  1. crawler and crawling, web pet is a robot that recursively iterates through various informational web sites, gets the first Web page, and then gets all the Web pages that the page points to, and then all the Web pages that the page points to, and so on. The throttle that tracks these web connections recursively will "crawl" along the network created by the HTML hyperlinks, so call them crawlers or spiders.
    • Where to start: the normalization of the extraction and relative links of the root set
    • links
    • Avoid the appearance of loops
    • loops (loops) and Replication (dups)
    • breadcrumbs Leave traces due to the number of URLs is huge, use complex data structures to quickly determine which URLs have been visited. Data structures should be very efficient in terms of access speed and memory usage. The following is a list of some useful techniques used by large-scale web crawlers to manage the addresses they have visited.
      1. tree and hash list
      2. lossy bitmap
      3. checkpoint
      4. classification, some large web robots will use the robot "cluster", each independent computer is a robot, to work in a sink. Assigning a specific URL "slice" to each robot has its responsibility to crawl.
    • aliases and Robot loops: If two URLs look different, but actually point to the same resource, the two URLs are said to be "aliases" to each other.
    • normalize URLs to eliminate duplicate space caused by URL aliases
    • file system connection loops
    • dynamic virtual Web space
    • avoid looping and repeating
      1. normalize url
      2. breadth-first crawling: Uniform allocation requests instead of being pressed on a single server.
      3. Throttle
      4. Limit the size of URLs
      5. url/site blacklist
      6. mode detection
      7. content thumbprint
      8. manual monitoring
  2. The HTTP of the robot. There is no difference between a robot and all other HTTP client programs, and they all adhere to the rules in the HTTP specification. Many robot views only implement the minimum set of HTTP required to request what they are looking for.
    • Identify the request header. The request header is important in tracking the owner of the wrong crawler and the type of content that the bot can handle to the service, so the robot implementation is encouraged to use the following: User-agent,from,accept,refer.
    • Virtual host. The robot implementation should support the host header.
    • Conditional request. Reduce the amount of content the robot wants to get. Conditional HTTP request, the event Stamp entity label is compared.
    • The handling of the response. Status code, entity.
    • User-agent Guide
  3. Poorly behaved robots: unruly robots can cause many serious problems. Here is a list of some of the mistakes that robots can make.
    • Runaway Bots, a dead loop that causes a heavy load on the server
    • URL of expiration
    • Very long error URL
    • The robot that loves to inquire
    • Dynamic Gateway Access
  4. deny bot access, and provide an optional file named Robts.txt in the Web server's document root directory.
      The
      • denies the bot access criteria. There are no formal standards, they are informal standards.
      • Web site and robots.txt file
        • Get robots.txt. The robot should transmit the identification information in the From header and User-agent header to help the site manager track the robot's access, and provide some contact information in the robot time that the site manager wants to query or complain about.
        • Response Code, depending on the response code, the robot makes different behavior. The format of the
        • robots.txt file. The file uses line-oriented syntax. There are three forms of lines, blank lines, comment lines, and rule lines.
          • user-agent rows, each of which starts with one or more user-agent lines of the following form (user-agent:<robtot-name> or user-agent:*). If no access is unrestricted.
          • Disallow and allow rows. Immediately following the user-agent line of the robot's rejection record. The
          • disallow/allow prefix matches. Prefix matching usually works well, but in several cases it is not strong enough to be expressive. If you wish to use no path prefixes, you are not allowed to crawl some special subdirectories, that robots.txt is powerless. Each path in the subdirectory must be enumerated separately.
          • other knowledge about robots.txt. The robot does not recognize the field is ignored, the middle can not be broken, the 0.0 version of the refusal of the bot access standard does not support allow line. The
          • cache and robots.txt expire. The robot periodically acquires robots.txt files and caches the resulting files. The
          • denies the Perl code that the bot accesses. There are several common Perl libraries that can be used to interact with robots.txt files. CPAN the Www::robotsrules module in the Public Perl documentation is one such example. The Robot-control meta tag of the
          • html, eg <meta name= "ROBOTS" content=directive-list>. The meta-instruction of the robot. META tags for search engines
  5. Specification of Robots
  6. Search engine
    1. Large pattern.
    2. The structure of modern search engines. Build some complex local databases called full-text indexing.
    3. Full-text indexing. is a database
    4. Publish a query request. When a user publishes a request to the web search engine gateway, an HTML form is filled in, and his browser sends the form to the gateway with an HTTP GET or POST request. The Gateway program parses the search request and translates the Web UI query into an expression that is required to search for full-text indexes.
    5. Sorts the results and provides the results of the query. Ranking of relevance (relevancy ranking)
    6. Fraud (spoof).

"HTTP" Web bot

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.