Search engine crawling statistics and analysis

Source: Internet
Author: User
Tags php language script tag vps web hosting

I. Necessity of statistical analysis of search engine spider crawling:

A smooth crawling of a webpage by a spider is a prerequisite for a webpage to be indexed by a search engine. Does the search engine crawl the website, what pages it crawls, and what information it returns to the spider, only by mastering these situations can we conduct targeted optimization and improvement on the website. Therefore, viewing spider crawling logs is a very important but painful task, especially for SEOER and the new webmaster. For example, if 200 0 64 is returned after a spider crawls a page on the Internet, it indicates that the webpage is very likely to be deleted by the search engine. If the HEAD request returns 404, it also indicates that the webpage is to be deleted, if we can detect this phenomenon in time based on logs, we can make timely adjustments according to the actual situation. For example, the 301,302 redirection and 404 error messages returned by spider crawlers are also important for website administrators. Therefore, we can see the necessity of analyzing spider crawling logs.

II. Method for counting spider crawling:

Because crawlers do not crawl JS (only 0 or once for multiple crawlers), flash, img, and other tags when crawling a website, currently, third-party statistical software (such as a river, Chinese webmaster station, Yahoo, google and other statistical systems) cannot collect statistics on spider crawling records. Currently, the following methods are used to analyze spider crawling: 1. Using PHP, ASP dynamically tracks records based on the USER_AGENT returned by the browser. This can indeed achieve the goal, but its disadvantages are obvious:

A) increase the server load. For websites with high content and weight, spider crawlers are very frequent. The code inserted on the webpage will increase the burden on the server.

B) because the search engine prefers static pages, many websites use CMS to generate static files for the content, so that no statistics can be made. A seo company in Hunan introduced the use of the img tag or script method, that is, the method of calling the statistics script using the img or script tag in a static file. This method has been tested by me for a month and cannot be implemented, spider does not crawl pages.

2. Using third-party log analysis tools, such as awstats in linux and Webalizer in windows, has obvious disadvantages. For example, if you are a VM user, because there are a lot of logs generated every day, it is very painful to download log files during each analysis. At the same time, these software is too professional and not suitable for general webmasters.

3. If you have a better crawling analysis method, please share it with the webmaster.

III. Summary of the development of log analysis tools for search engine spider crawling statistics:

1. We need to care about spider crawling information in log analysis:

A) spider crawling date: Based on this, find the pattern of spider crawling.

B) spider IP address: Spider IP addresses of different sources have different functions. You can combine the crawling date and the request method HEAD and GET to summarize the rule in more detail.

C) request methods: mainly include the HEAD and GET methods, which have different functions. Generally, the HEAD method is one or multiple 404 errors that occur when the last time the spider visits the website. Therefore, the spider sends a HEAD request to check whether the webpage exists. Once this request is sent, if the returned result is still 404, then your webpage will be deleted by the search engine from the database. The GET method is not described.

D) crawling page: which pages does a spider crawl.

E) status code: the status code returned by the server to the spider. We generally care about 200,301,304,302,404, especially 404 and 301,302. 404 indicates that the link is dead, which greatly affects website optimization. 301,302 is currently not recognized by search engines and is suspected of cheating.

F) traffic: restrict traffic to save valuable server resources.

Based on the above considerations, you can develop a simple, but functional, crawler statistics program using WEB language in your own WEB space. In this way, it is necessary to check spider crawling logs anytime and anywhere, and it can avoid the pain of downloading logs (of course, if you are using an independent server and you are proficient in using professional log analysis tools, we will not mention it). If you can achieve the above functions, we will combine some third-party statistical tools, then our webmaster can leave out the professional log analysis software.

2. Development language selection: as this analysis program is deployed on a WEB server, it is very important to consider portability. In WEB languages, JSP, php, asp, in asp.net, JSP is not generally supported by servers, ASP and.. net is not supported in LINUX. The only option available is the PHP language. Generally, windows and linux hosts are supported. It has better portability.

3. Program scalability: after all, a person's ability cannot meet different requirements. Therefore, the program isolates data analysis and performance during design, the spider data analysis module generates a file with only seven lines of code, which can be easily rewritten based on the server log format. Therefore, if you change the space or log format, you only need to rewrite the analysis module according to the interface standards provided by us, which does not require a very high level of program development. You do not need to move any other files, or you provide your log sample to us, we will rewrite it for you. the program can also customize statistics for the spider type, which can improve the analysis speed and remotely analyze logs.

4. Difficulties and limitations: a very important problem encountered during the development process is the analysis speed. The WEB logs are dozens or hundreds of megabytes or even G-level. Therefore, the PHP language is used for analysis, it is important to consider both the server capacity and the analysis speed. Otherwise, timeout may easily occur. Therefore, a set of efficient algorithms is very important. At the same time, because there are many log records, we give up using the database, because the insertion of hundreds of thousands of data records and the query of millions of data records are quite painful. If the pressure on the server is too high, your server's instantaneous CPU will reach the peak, in addition, this data does not need to be stored for a long time. After all, we need to consider most of the web hosting servers. at the same time, to meet the requirements of some webmasters, the program writes the analyzed Spider logs into text files in a certain format. You can write a simple read file in any language by yourself, the code inserted into the database keeps the logs for a long time. By comparing different algorithms, the best algorithm analysis speed is as follows:

Local analysis: P4 1.7G + M memory + WinXp environment (Notebook ):

Log: 1 million lines, full analysis Time: 10-15 seconds

VPS: 384M memory + Linux:

Log: 1 million lines, full analysis time: 22-28 seconds

Remote analysis: the log and analysis system are not on the same machine. The speed of remote analysis depends on the speed between two networks.

Remote Environment (log storage location): VPS: 384M memory + Linux, 10 m bandwidth sharing

Local environment (analysis system location): P4 1.7G + M memory + WinXp environment (Notebook), 2 m adsl dial-up network

Log: 0.15 million lines, full analysis Time: 20-25 seconds

Therefore, the remote analysis speed is only 1/10 of the local analysis speed. Therefore, we recommend that you upload the system to a WEB server to save valuable traffic.

The above is my summary of developing this program. The program has completed the core analysis and display functions.

The above is my summary of developing this program. The program has completed the core analysis and display functions.

Demo URL: http://www.17buyhost.cn.

More detailed development: http://www.17buyhost.cn/help.html

I hope the webmaster can give me some advice, mainly in improving algorithm efficiency. After further improvement, the download address will be released. This article was first published in admin5. Reposted, please ensure its integrity!

Statement: This article is from admin5.com

Http://www.admin5.com/article/20080807/97849.shtml

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.