The continuation of the Bo hundred Excellent own content: Web spiders How to deal with files

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Last time we talked about web spider Technology (1) _ Bo Hundred Excellent technology, today we mainly talk about the Web spider for the processing of documents:

(i) Binary file processing

In addition to a large number of HTM files and XML files on the network, there are also a large number of binary files, in order to make the content of the Web page richer, pictures and multimedia files by a large number of references. They also appear as hyperlinks in the Web page, so they are also placed in the queue to be accessed during the link extraction phase. It is impractical for binaries to complete the indexing of files through the contents of the file, and the technology has not yet reached the point where the contents of the file can be understood through binary files.

Therefore, the processing of these files is usually handled in a separate way, the understanding of the content needs to rely on the binary file anchor point description to complete. The anchor point description usually represents the title or basic content of the file. Anchor information is generally provided by referencing the Web page, rather than in the binary file itself. Binary files need to be processed separately due to the different kinds of problems.

(ii) Processing of script files

The script file here refers to the client script included in the Web page, which runs when the Web page is downloaded to the client, and usually completes some simple interaction on the customer. Script files are typically responsible for the display of pages in a Web page, but are also responsible for interacting with the server side because of the widespread use of Ajax technology.

Because of the diversity and complexity of scripting languages, the analysis and processing of the script language is tantamount to making a simple web analytics program, because it is difficult to deal with script files, and many small search engines tend to omit processing of them directly. However, because the Web designer is now in need of no refresh page requirements and the use of AJAX technology, if the neglect of its processing will be a huge loss.

(iii) Do not ask for file type handling

For the Web page content extraction analysis has been an important technical link for web spiders, for different file types on the Web processing, web spiders are usually used plug-ins to deal with the way. It will have a more intelligent plug-in management program is responsible for the management of different plug-ins, for the different types of files to be processed, it will invoke different plug-ins to deal with, the reason is the form of plug-ins, mainly for the expansion of the consideration.

There are many different types of files on the Internet, different files require a completely different approach, and the network is constantly changing, and there will be a new file type at any time. The easiest way to complete a new type of processing is to write a new plug-in to the new type and then direct the plug-in to the management program. At the same time the writing of this plugin is best done by the creator of the new file format, and usually only the manufacturer understands the meaning of the new format definition.

SEO is facing the search engine, and the Spider program is its core, so understand and grasp the principle of spider to us to do SEO more important, Bo Hundred excellent now ranked, no change, waiting to update the change, today's visit to the Chen Jinxian ranked second, and so will share with everyone.

Write this article is not easy, reproduced please specify the source: http://www.51bobaiyou.com/post/49.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.