&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp; The search engine establishes the Web page index, processing the object is the text file. For web spiders, crawling down pages includes a variety of formats, including HTML, images, doc, PDFs, multimedia, dynamic Web pages, and other formats. After these files are crawled, the text information in these files needs to be extracted. Accurate extraction of the information of these documents, on the one hand, the search engine accuracy has an important role, on the other hand, the network spiders correctly track other links have a certain impact.
For documents such as Doc, PDF, and the software produced by the professional manufacturer, the vendor will provide the corresponding text extraction interface. Network spiders only need to invoke the interface of these plug-ins, you can easily extract the text information in the document and other related information.
HTML and other documents, HTML has its own syntax, through different command identifiers to represent different fonts, colors, locations and other layouts, such as:,,, and so on, to extract text information need to filter out these identifiers. Filtering identifiers is not difficult because these identifiers have certain rules, as long as the corresponding information is obtained by different identifiers. But when you recognize this information, you need to sync a lot of layout information, such as the font size of the text, whether it's a caption, whether it's bold, whether it's a page keyword, and so on, which helps you calculate how important the word is in the Web page. At the same time, for HTML pages, in addition to the title and text, there will be many ad links and public channel links, these links and text is not a little relationship, in the extraction of web content, also need to filter these useless links. For example, a website has "product introduction" channel, because the navigation bar in the site every page has, if not filtered navigation link, in the Search "product introduction", the site will search every page, will undoubtedly bring a lot of junk information. Filtering these invalid links requires statistics of a large number of Web page structure rules, extraction of some common, unified filtering; For some important and results-specific sites, also need to deal with individual. This needs the network Spider's design to have certain expansibility.
For multimedia, pictures and other documents, generally through the linked anchor text (that is, linked text) and related file comments to determine the contents of these files. For example, there is a link text for "Maggie Zhang", the link to a BMP format of the picture, then the Web spider know this picture is the content of "Maggie Cheung's photos." In this way, search for "Maggie" and "photos" can be the search engine to find this picture. In addition, many multimedia files have file attributes that you can consider to better understand the contents of the file.
Dynamic Web pages have always been a problem for web spiders. The so-called Dynamic Web page, is relative to static Web pages, is automatically generated by the program of the page, such benefits can be quickly unified change the style of the Web page, but also can reduce the space occupied by the Web server, but also to crawl the Web spider brought some trouble. As the development of language continues to increase, dynamic Web pages are more and more types, such as: ASP, JSP, PHP and so on. These types of Web pages may be a little easier for web spiders. Web spiders are more difficult to deal with are some scripting language (such as VBScript and JavaScript) generated Web pages, if you want to improve the processing of these pages, web spiders need to have their own script interpreter. For many of the data is placed in the database site, the need to search through the database to obtain information on this site, which brings great difficulties to the crawl of web spiders. For such sites, if the Web designer wants the data to be searchable by search engines, you need to provide a way to traverse the entire contents of the database.
For the Web page content extraction, has been the network spider important technology. The whole system is generally in the form of Plug-ins, through a plug-in management Service program, encountered different forms of web pages using different plug-ins to deal with. The advantage of this approach is that the scalability is good, and each new type of discovery later, it can be processed as a plug-in to add to the plug-in Management Service program.
Search engine is not mature, not perfect, more in line with people's search habits, will not be part of the clean-up has been included in the page, for their own site in the search engine included reduced not scary, as long as we seriously analyze the reasons, in good faith to face, search engines will give you more flow returns. Below I will talk about the reduction of the treatment method.
I. Appropriate reduction of the amount included
If your site is included in the search engine, whether Baidu or Google, a certain day to reduce the amount of a part of this part of the site I refer to the total amount of one-tenth or less, this situation we do not worry, this may be the search engine in the small adjustment, this is normal, For example, my station dew CMS in this Baidu big update when the collection amount is still less than 200, this is nothing. People have a cold, not to mention the machine, the program also has errors, computing may progress.
Two. The amount is crazy to drop
If you are included in the amount, within one weeks, included in the amount of half or more, this time, you must pay attention to, can not be careless. Carefully check to see if the problem is not space, the problem of Web code, web code to let people to malicious changes did not, whether the revision, to see the latest day traffic map, the keyword direction, is not a violation of the content and so on, we can think of all look at it.
Three. The collected amount is 0
If included in the amount of a night to 0, this most of the time is the major search engines in a large number of changes in parameters, the parameters of your station, of course, to give you a clear 0. Like my station aabc.cn 0 a night, this want to recover, time will be a bit longer, then you if the domain name is not very important, you can let go, or update the time to use less points, pay attention to the quality of the update, more original, more than a bit high weight of the outside link. If you're lucky, you can recover part of it in one weeks. Of course, if your station is at the same time by several search engines to clear 0, then you had better not this station, there may really be a certain aspect does not meet the search engine's appetite. Under normal circumstances basically only Baidu will clear 0 of the volume, Google is very few, other search engines such as Soso Yahoo also has a clear 0 of the possibility. But they don't give us a lot of traffic.
To sum up, we as small and medium-sized webmaster inseparable from the search engine, of course, more inseparable from Baidu, we only adapt to them, our station has the opportunity to grow and develop. I wish you all good luck and happy every day. The above is my own views, talk more superficial, welcome everyone to correct me, and I Exchange qq:93065410 Dew CMS website: http://www.luzhuba.cn.