Brief introduction of network spider operation mode

Source: Internet
Author: User
Keywords Operation mode web spider


The Web spider enters a website, generally will visit a special text file Robots.txt, this file generally puts in the website server's root directory, like: Http://www.ithov.com/robots.txt. Site administrators can define which directory Web spiders are inaccessible through robots.txt, or which directories are inaccessible to certain web spiders. For example, some Web site executable directories and temporary file directories do not want to be searched by search engines, so webmasters can define these directories as denying access to the directory.



Robots.txt syntax is simple, for example, if there are no restrictions on the directory, you can use the following two lines to describe: User: * Disallow: Of course, Robots.txt is just a protocol, if the Web spider's designers do not follow this protocol, webmasters can not prevent web spiders to some pages of access, but the general network of Spiders will follow these protocols, and webmasters can also be other ways to refuse web spiders to some of the web crawl. When Web spiders download Web pages, they will identify the HTML code of the Web page, which will have meta tags in its code section. Through these signs, can tell the network spider this page needs to be crawled, also can tell the network spider this page link whether need to be continued tracking.



For example, this page does not need to be crawled, but links within the page need to be tracked. With respect to Robots.txt's syntax and META tag syntax, the general web site now wants search engines to crawl their web pages more fully because it allows more visitors to find this site through search engines. In order to make the Web site more comprehensive to be crawled to, webmasters can build a site map, that is, the sitemap. Many web spiders will be the sitemap.htm file as a Web site crawling portal, webmaster can put the site inside all the pages of the link in this file, then the Web spider can easily crawl the entire site down, to avoid missing some pages, but also reduce the burden on the Web server. Content extraction, search engine set up Web page index, processing object is text file. For web spiders, crawling down pages includes a variety of formats, including HTML, images, doc, PDFs, multimedia, dynamic Web pages, and other formats.



After these files are crawled, the text information in these files needs to be extracted. Accurate extraction of the information of these documents, on the one hand, the search engine accuracy has an important role, on the other hand, the network spiders correctly track other links have a certain impact. For documents such as Doc, PDF, and the software produced by the professional manufacturer, the vendor will provide the corresponding text extraction interface. Network spiders only need to invoke the interface of these plug-ins, you can easily extract the text information in the document and other related information. HTML and other documents, HTML has its own syntax, through different command identifiers to represent different fonts, colors, locations, and other layouts, to extract text messages need to filter out these identifiers. Filtering identifiers is not difficult because these identifiers have certain rules, as long as the corresponding information is obtained by different identifiers.



But when you recognize this information, you need to sync a lot of layout information, such as the font size of the text, whether it's a caption, whether it's bold, whether it's a page keyword, and so on, which helps you calculate how important the word is in the Web page. At the same time, for HTML pages, in addition to the title and text, there will be many ad links and public channel links, these links and text is not a little relationship, in the extraction of web content, also need to filter these useless links. For example, a website has "product introduction" channel, because the navigation bar in the site every page has, if not filtered navigation link, in the Search "product introduction", the site will search every page, will undoubtedly bring a lot of junk information.



Filtering these invalid links requires statistics of a large number of Web page structure rules, extraction of some common, unified filtering; For some important and results-specific sites, also need to deal with individual. This needs the network Spider's design to have certain expansibility. For multimedia, pictures and other documents, generally through the linked anchor text (that is, linked text) and related file comments to determine the contents of these files. For example, there is a link text for "Maggie Zhang", the link to a BMP format of the picture, then the Web spider know this picture is the content of "Maggie Cheung's photos." In this way, search for "Maggie" and "photos" can be the search engine to find this picture.



In addition, many multimedia files have file attributes that you can consider to better understand the contents of the file. Dynamic Web pages have always been a problem for web spiders. The so-called Dynamic Web page, is relative to static Web pages, is automatically generated by the program of the page, such benefits can be quickly unified change the style of the Web page, but also can reduce the space occupied by the Web server, but also to crawl the Web spider brought some trouble. As the development of language continues to increase, dynamic Web pages are more and more types, such as: ASP, JSP, PHP and so on.



These types of Web pages may be a little easier for web spiders. Web spiders are more difficult to deal with are some scripting language (such as VBScript and JavaScript) generated Web pages, if you want to improve the processing of these pages, web spiders need to have their own script interpreter. For many of the data is placed in the database site, the need to search through the database to  information on this site, which brings great difficulties to the crawl of web spiders. For such sites, if the Web designer wants the data to be searchable by search engines, you need to provide a way to traverse the entire contents of the database. For the Web page content extraction, has been the network spider important technology.



The whole system is generally in the form of Plug-ins, through a plug-in management Service program, encountered different forms of web pages using different plug-ins to deal with. The advantage of this approach is that the scalability is good, and each new type of discovery later, it can be processed as a plug-in to add to the plug-in Management Service program. Update cycle, because the content of the site is constantly changing, so the web spider also need to constantly update its crawl page content, which requires web spiders to scan the site according to a certain period of time to see which pages are needed to update the page, which pages are new pages, which pages are expired dead links.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.