Chinese search engine technology unveiling: web spider (4)

Source: Internet
Author: User
Tags bmp image

Source: e800.com.cn


Content Extraction

The search engine creates a web index and processes text files. Web Crawlers capture webpages in various formats, including HTML, images, Doc, PDF, multimedia, dynamic webpages, and other formats. After these files are captured, you need to extract the text information from these files. Accurately extracting the information of these documents plays an important role in the search accuracy of the search engine, and affects the web spider's correct tracking of other links.

For documents such as Doc and PDF, the vendor provides corresponding text extraction interfaces for documents generated by software provided by professional vendors. The web spider can easily extract text and other file-related information by calling the interfaces of these plug-ins.

HTML and other documents are different. html has its own syntax. Different command identifiers are used to represent different fonts, colors, locations, and other la s, such, these identifiers must be filtered out when extracting text information. It is not difficult to filter identifiers, because these identifiers have certain rules, as long as the corresponding information is obtained according to different identifiers. However, when identifying such information, You Need To synchronously record many layout information, such as the text font size, whether it is the title, whether it is bold display, whether it is a page keyword, etc, this information helps calculate the importance of words on the web page. In addition to the title and body of an HTML webpage, there are many ad links and public channel links. These links have nothing to do with the text body. When extracting the content of the webpage, you also need to filter these useless links. For example, a website has a "Product Introduction" channel, because the navigation bar is available on every webpage of the website. If you do not filter the navigation bar link, when searching for "Product Introduction, each page in the website will be searched, which will undoubtedly bring a lot of junk information. To filter invalid links, you need to calculate a large number of Web Page Structure rules, extract some commonalities, and filter them in a unified manner. For important websites with special results, you also need to process them individually. This requires that the design of the web spider be scalable.

For multimedia files, images, and other files, the content of these files is usually determined by the linked anchor text (that is, the link text) and Related File annotations. For example, if a link text is "Zhang manyu photo" and its link points to a BMP image, the web spider will know that the image content is "Zhang manyu's photo ". In this way, the search engine can find this image when searching for "Zhang manyu" and "Photos. In addition, many multimedia files have file attributes. Considering these attributes, you can better understand the file content.
Dynamic Web pages have always been a problem for web crawlers. The so-called dynamic web page is a page automatically generated by a program compared with a static Web page. The advantage of this is that you can quickly change the webpage style and reduce the space occupied by the server on the webpage, but it also brings some trouble to web crawlers. Due to the increasing number of development languages, there are more and more types of dynamic web pages, such as ASP, JSP, and PHP. These types of web pages may be a little easier for web spider. Web spider is hard to handle Web pages generated by some scripting languages (such as VBScript and JavaScript). If you need to complete these web pages, web spider needs to have its own script interpreter. For websites where a lot of data is stored in databases, it is necessary to search through the database of the website to obtain information, which brings great difficulties to web crawlers. For such websites, if the website designers want the data to be searched by the search engine, they need to provide a method to traverse the entire database.

Web content extraction has always been an important technology in web crawlers. The entire system is generally in the form of plug-ins, through a plug-in management service program, encounter different formats of web pages using different plug-ins for processing. The advantage of this method is that it has good scalability. Every time a new type is found in the future, you can make the processing method a plug-in and add it to the plug-in management service program.

Update cycle

Because the content of a website is constantly changing, web crawlers also need to constantly update their webpage content. This requires web crawlers to scan the website according to a certain period of time, view which pages are to be updated, which are new pages, and which are expired dead links.
The update cycle of a search engine has a great impact on the query completion rate of a search engine. If the update cycle is too long, some newly generated web pages will fail to be searched. If the update cycle is too short, the technical implementation will be difficult and the bandwidth and server resources will be wasted. Not all websites of search engines use the same cycle for updates. For important websites with a large number of updates, the update cycle is short, such as some news websites, update once in a few hours. On the contrary, for some unimportant websites, the update cycle is long and may only be updated once a month or two.
Generally, when updating the website content, web crawlers do not need to crawl the website page again. For most webpages, they only need to determine the webpage attributes (mainly dates ), compare the obtained attributes with the attributes captured last time. If the obtained attributes are the same, they do not need to be updated.

Conclusion

This article mainly discusses the technical points related to network spider. If you want to design a network Spider, you need to learn more technical details. For more information, see the document.

Web spider plays an important role in search engines, and affects the search engine's data capacity, moreover, the quality of Web Crawlers directly affects the number of dead links on the search results page (that is, the webpage to which the link is directed does not exist. Currently, how to discover more webpages, how to correctly extract webpage content, how to download dynamic webpages, how to provide crawling speeds, and how to identify webpages with the same content in the website are all further improved by web crawlers. problem.

More references

Note: Since the following references are not published in some journals in the form of papers, there is no apparent source, you can get the download link for the relevant article by searching the article title on Google or Baidu search engine.
[1] Chinese search engine technology unveiling: Chinese word segmentation.
[2] Chinese search engine technology unveiling: Sorting Technology.
[3] unveiling the Chinese search engine technology: system architecture.
[4] robots & spiders & crawlers: How web and intranet search engines follow links to build indexes. Author: Avi rapports.2001.
[5] guidelines for Robot writers. Author: Martijn Koster, 1993.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.