Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Last time we talked about web spider Technology (1) _ Bo Hundred Excellent technology, today we mainly talk about the Web spider for the processing of documents:
(i) Binary file processing
In addition to a large number of HTM files and XML files on the network, there are also a large number of binary files, in order to make the content of the Web page richer, pictures and multimedia files by a large number of references. They also appear as hyperlinks in the Web page, so they are also placed in the queue to be accessed during the link extraction phase. It is impractical for binaries to complete the indexing of files through the contents of the file, and the technology has not yet reached the point where the contents of the file can be understood through binary files.
Therefore, the processing of these files is usually handled in a separate way, the understanding of the content needs to rely on the binary file anchor point description to complete. The anchor point description usually represents the title or basic content of the file. Anchor information is generally provided by referencing the Web page, rather than in the binary file itself. Binary files need to be processed separately due to the different kinds of problems.
(ii) Processing of script files
The script file here refers to the client script included in the Web page, which runs when the Web page is downloaded to the client, and usually completes some simple interaction on the customer. Script files are typically responsible for the display of pages in a Web page, but are also responsible for interacting with the server side because of the widespread use of Ajax technology.
Because of the diversity and complexity of scripting languages, the analysis and processing of the script language is tantamount to making a simple web analytics program, because it is difficult to deal with script files, and many small search engines tend to omit processing of them directly. However, because the Web designer is now in need of no refresh page requirements and the use of AJAX technology, if the neglect of its processing will be a huge loss.
(iii) Do not ask for file type handling
For the Web page content extraction analysis has been an important technical link for web spiders, for different file types on the Web processing, web spiders are usually used plug-ins to deal with the way. It will have a more intelligent plug-in management program is responsible for the management of different plug-ins, for the different types of files to be processed, it will invoke different plug-ins to deal with, the reason is the form of plug-ins, mainly for the expansion of the consideration.
There are many different types of files on the Internet, different files require a completely different approach, and the network is constantly changing, and there will be a new file type at any time. The easiest way to complete a new type of processing is to write a new plug-in to the new type and then direct the plug-in to the management program. At the same time the writing of this plugin is best done by the creator of the new file format, and usually only the manufacturer understands the meaning of the new format definition.
SEO is facing the search engine, and the Spider program is its core, so understand and grasp the principle of spider to us to do SEO more important, Bo Hundred excellent now ranked, no change, waiting to update the change, today's visit to the Chen Jinxian ranked second, and so will share with everyone.
Write this article is not easy, reproduced please specify the source: http://www.51bobaiyou.com/post/49.html