Hello, I'm brother Tao. The event has many webmaster from the strategic level to share the idea of the site operation, just found a lot of friends once said the idea is difficult to explain the example, so I took a last year's example, to share with you how we find the problem from the log analysis, solve the problem to the final summary of lessons and optimize the operation of the site process, At the same time I will detail the details of the way to popularize the log analysis, I hope to help friends.
There is a link in the website operation is vital, that is data monitoring and data analysis, otherwise the problem does not know where. and crawler log analysis as a data monitoring and analysis is part of all SEO technology is the most basic, but not the simplest. The foundation is that all scientific SEO strategies must rely on data analysis, and logs are one of the few channels that can directly understand how the search engine and our site interact and are first-hand data before traffic arrives. It is not simple because the data storage and processing, the log from dozens of MB, hundreds of MB, several GB, dozens of GB, hundreds of GB, several TB tools and deployment of the difficulty is completely different: Dozens of MB with UltraEdit and other text editor can do data splitting Hundreds of MB of time to use shell, and if it is a few GB can start to consider the use of MySQL or other rows stored database to store the field after the segmentation of the structured log, hundreds of GB can be on the infobright or other columns stored database If you reach TB that only Hadoop hive can be solved, currently in use hive seoer I know only net zero great God.
For 99% webmaster friends, Shell used to process the log has been very competent, I currently use the shell. Here is a brief introduction to the concept of the shell, the shell can be understood as a Nix system cmd command line, and log split in the common shell instructions have cat and awk (actually grep used a lot, but this article in order to control the length is not more introduced). The real role of cat is to merge multiple files and print the results of standard output (stdout) to the screen. The awk instruction actually writes a book as a programming language, and his feature is that you can split the text field and process it by separator, and the default delimiter is a space. The above two instructions online data pile, suggest you all take time to study.
To get to the point, as a website operation data according to the monitoring system part of our company to establish a full automatic crawler log analysis system, function can refer to the Night Interest blog. Full Automation Log Analysis script is very good to play the role of monitoring and early warning, but not omnipotent, the log if there are inexplicable errors, it is still to have their own hands and clothing. In the Sunday records of the second week of December last year, the 301 and 302 response codes of the Baidu Crawler grew significantly, and this part of the catch increased a week 6w! Because the amount of spiders in a certain time is constant, then the crawl of the wrong page represents the fall of other pages crawl, in this case, the inner page is the most affected. This drastic fluctuation in the amount of grab means there must be something wrong with it, so the first step we need to make is to identify which pages are in question.
Post a log (the goal is to help you locate the specific domain of awk in the example below), because there is a non-compete agreement, so the key part of the following I code it:
<IGNORE_JS_OP>
Next, we first construct the first instruction to query: According to the above log, we can know that the HTTP status code is in the Nineth domain, which is the $ $, and the URI is at $, and then we combine the log from last week to extract all the logs with a status code of 302. At the same time, the domain name and URI are merged into a complete URL, and finally we split the return 302 status code of the most TOP10 level two directory, sorted by the number of visits in descending order. Since I forgot the screenshot, I now use the following code for your reference:
Cat 20121203.txt 20121204.txt 20121205.txt 20121206.txt 20121207.txt 20121208.txt 20121209.txt | awk ' {if ($ ~ ' 302 ') print $16$7} ' | Awk-f "/" ' {print $} ' | Sort | uniq-c | Sort-nr | Head-n10
ABC found in 10 results (OK, I'm ashamed to hit the code) this two-level catalogue is exceptionally eye-catching, the number reached 2W, and I compared last week's 302 data, found that the ABC directory has 302 problems, but the order of magnitude in the 1K, and the page that returned 302 before the ABC directory is completely correct. At this point I began to suspect that there are anomalies in the ABC level two directory, as well as the number of pages in the ABC level two directory is still numerous, if we want to know what kind of pages are specific problems, or to further down the breakdown. My idea is very simple, continue digging the directory hierarchy of URLs to determine where the problem appears.
1, first calculate how many times the ABC directory jump by the Baidu crawler crawl to
<IGNORE_JS_OP>
2, add a judgment conditions, to determine the problem of the URL directory level is specific to the first layer
<IGNORE_JS_OP>
<IGNORE_JS_OP>
<IGNORE_JS_OP>
3, after discovering the directory level of the wrong URL, we will take this part of the error URL to extract a 100 out to see
Cat 20121203.txt 20121204.txt 20121205.txt 20121206.txt 20121207.txt 20121208.txt 20121209.txt | awk ' {if ($ ~ ' 302 ') print $16$7} ' | Awk-f "/" ' {if ($ ~ "ABC" && $) print $} ' | head-n100
Through the above query found in different level two domain name of the ABC directory there are similar error URLs (such as domainname/abc/123/123) crawled to the crawler, then you can be sure that this must be a lot of problems, and is the template on the error caused by a large number of pages affected, It is not an error on individual pages.
So we can go back to the end of the problem is found in the product page related product link module, in the previous week when the relevant functions of the research and Development department colleagues in order to save trouble, directly here a label written in this format relative path (<a href= "123456″>xxx </a>), the URL is correct when the user accesses the current URL as domainname/abc/123 and the relative path becomes domainname/abc/123456 after completion. Unfortunately, the on-line function of the development of the same time the adjustment of the rewrite rules, resulting in the original domainname/abc/123/this non-standard URL301 jump to the standard without the slash of the URL rule due to the coverage and invalidation, the result is DOMAINNAME/ABC /123/actually returned 200 code, and the fatal is, the relative path in this case is fully domainname/abc/123/123456, that is, we log analysis of the last found in the wrong URL form. And this kind of wrong URL is inexplicably matched to 301 and 302 of the rewrite rules, and then the wrong crawl volume out of hand. The screenshot was as follows:
<IGNORE_JS_OP>
Now that the root of the problem has been found, the next step is to develop a solution:
1, modify the product page template: The relevant product links using the root directory to start the relative path (that is, <a href= "/abc/123456″>xxx</a>);"
2, to increase the standard URL rewrite rule: will be with a slash URL 301 to the rewriterule without a slash (^/abc/) ([0-9]+)/$/abc/$1 [r=301,l])
3, because Baidu response to 301 is very slow, so it is necessary to use the website of Baidu Webmaster Platform of the advanced revision of the rules to enhance the URL standardization effect: Replacement rules for abc/#/<>, and then fill in two examples to help Baidu match can be. Here's a quick way to say, the # in the substitution rule represents any number that represents the digital ID of the product page in our instance, and the last <> is the content in <>, where it is empty, directly replacing the recently matched text, which is the trailing slash at the end of the URL. But unfortunately, if the standard URL is encountered with a slash/abc/123/this, it can not use the Baidu Revision tool to do, because the current Webmaster Revision tool rules are very rigid, nor as can be used after the reference, so this situation is only embarrassing.
4, optimize the website operation process: This article has written a lot of technical details, that's a bit off the subject of operational optimization, but by seeing the nature of the phenomenon, I'm writing this whole process to tell you that the root cause of this problem is actually a problem with the operational process because this kind of error could have been avoided, It's not supposed to happen online! The 3 above are just a symptom, to start from the source: one, optimize the entire Web site version of the online process, the site usually involves the front-end code changes in the program version, all need to be tested after the pass by the SEO staff to do the front-end code of quality control, equivalent to two confirmed, confirmed before the release to the formal environment; Conduct the SEO training for the production and research department, and with the production and research departments to establish a good communication and trust relationship (this requires the SEO personnel their own technology hard enough to make research and development convincing, but also have excellent interpersonal skills and personality charm), improve the production and research staff on the SEO understanding and attention to the degree, Create a FAQ and keep it updated. Third, bugs are linked to performance. IV. Push process optimization plan and implementation.
Final Summary:
1, data (not only log) monitoring is very important, it is best to organize the weekly chart, focus on the trend line, followed by the absolute value;
2, it is very important to split the large data into small data step by step.
3, a good workflow is to ensure that the entire team of high efficiency of the most important;
Digression:
There's a little bit of detail to say about the jump, that is multiple jump: A301 to b,b301 to c,c301 to D and so on, if your site URL revision is also very frequent this problem you will find sooner or later, because the development of the view is whether 2 or 3 times or 100 times, It's OK for him to jump to the last page anyway. But for SEO this is a terrible thing, because almost every 1 jumps you are losing the weight of the old page percent transfer percentage, Matt Cutts on YouTube, Google Webmaster area talked about this issue, Google can accept the scope is only one or NonBlank, cames Three, Baidu according to my observation is basically also up to 2 times. Back to the theme of the operation process, if the development of each encounter this kind of problem he can be aware of the potential impact on the SEO, and will be proactive inquiries, then you can imagine in such a company to do SEO is a lot of happiness things.