After too much analysis of the site diary log files we can see users and search engine spider visit the behavior of the site data, this data allows us to analyze the user and spider on the site's preferences and the site's health environment. In the site diary analysis, the first thing we need to appreciate is the spider behavior.
in the spider Crawl and included in the process, the search engine will give a specific weight site allocation response to the amount of resources. A search engine friends of the site should be vain operation of these resources, so that spiders can be quick, accurate, comprehensive climb take the price, the form of user love, but not the resources in the useless, visit the content of the anomaly.
but because the amount of data in the Web log is too large, we generally need to use the Web Diary appreciation tool to check. Frequently used diary analysis tools are: Light-years diary parsing tool, web&http://www.aliyun.com/zixun/aggregation/37954.html ">nbsp;log exploer."
in the Appreciation Diary, for a single day log file we need to parse the form are: Interview times, stranded years, crawl volume, Catalog crawl statistics, page crawl statistics, Spiders interview IP, HTTP status code, Spider Wonder Time, spider crawl path, etc. For many days diary files we need to analyze the content are: Spiders meet the trend, stranded time trend, individual crawl trend, the contents of the crawl trend, grab years, spiders annoying melancholy cycle.
under the small head through train auction software small series to see how the Web Diary analysis?
website Diary Data Analysis and Interpretation:
1, interview times, progress years, grasping quantity
from these three items of data we can learn: evenly each crawl page number, single page crawl stay years and average each stranded years.
symmetry per crawl page = Total Crawl/number of visits
single page Crawl pause = every pause/crawl
average duration of stay = Total stranded time/number of meetings
From these data we can see the spider's vivid degree, affinity, crawl depth and so on, the total number of interviews, progress time, the higher the crawl, the average crawl page, the average stranded times, the more the site is found by search engine like. and a single page crawl pause time to explain the site page visit rate, the longer the time, to express the speed of the site interview slower, search engine crawl included more favorable, we should keep in mind even if the travel page loading speed, cut single stand Progress time, let the crawler cost more to crawl included.
other, according to these data we can also statistics for a period of time, the overall trend of the site revealed performance, such as: The number of spiders meet trend, pause time trend, crawl trend.
2, grab statistics
through the log parsing we can see which contents of the site by the spider Love, grab the depth of the contents, critical page style record capture status, useful page wind capture status, etc. We can create more problems by comparing the page crawl and the environment under the catalogue. For the needs of the catalog, we need to go through the table to mediate growth weights and crawl, to deal with useful pages, robots.txt in the development of shielding.
In addition, through the right amount of daily diary statistics, we can see the table and exterior behavior to the catalog results, optimization is reasonable, whether to achieve the desired effect. For a counter catalog, for a long period of time, we can see that the contents of the page are exposed, according to the reason for the flash of action.
3, page crawl
in the site log analysis, we can see in detail by Spiders crawl page. In these pages, we can analyze the spiders crawled what needs to be prevented from crawling the page, crawled what is not included in the price page, crawled the page URL, and so on, for sufficient to coax the spider profit we need to robots.txt these sites in the climb to stop.
the rest, we can also analyze the page is not included, to deal with the new article, is because it has not been crawled and not included or crawled but did not release. About some reading meaning of the page, general we need it as a crawl channel, to deal with these pages, we should do noindex label. But on the other hand, the spider will be retarded to rely on these meaningless channel pages crawl page, spiders do not understand Sitemap? "To this, the author has puzzled, asks to share the instruction"
4, spider visit IP
has been proposed by the IP section of the spider to arbitrary site down the right environment, silly bird feeling this meaning is not very, because this after the knowledge is too strong. Moreover, the right to drop more should be the former three data to judge, with a single IP paragraph to determine the meaning is not big. IP analysis of more use should be decisive can have to include spiders, false spiders, malicious click Spiders.
5, Access profile code
spiders often emerge from the shape code such as 301, 404, and so on, rendering these shape code to timely disposal of punishment to prevent the web caused by Qianan impact.
6, crawl time segment
through moderate analysis of a number of spider hours a day crawl, we can recognize the specific spider for the web in the time of the dull time. By comparing weekly data, we can see the vivid cycle of a particular spider in a week. Clear this, to deal with the form of the site update time has to lead the meaning, and before the so-called "a senior" is not scientific argument.
7, Spider crawl Path
in the site log we can track the access ladder for a specific IP, assuming that we follow a specific spider's interview trail to find out about the spider's crawl ladder preference in this web construct. As a result, we can properly guide spiders climb the ladder, so that spiders more crawling needs, price, new update page. In this way we can combine the physical structure of the page and the URL of the logical organization to crawl hobby. This allows us to scan our site from the search engine's perspective.
(This article by the small head Baidu 360 through-train bidding software trial Station provides: www.xiaonaodai.com reprint please keep)
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.