A note earlier, just recorded two lines of code, put out the words is also a low quality page, so set up in order to only be visible, tonight to add.
Remember when the scene should be submitted sitemap when Baidu old hint of the wrong URL, cause sitemap can not be crawled, so in a way to solve this problem, so have the following notes: Use Shell to find out the site empty page and 404 error pages.
Nonsense not much to say, directly on the shell code:
Copy Code code as follows:
Time Cat Sitemap.txt|while Read Line;do curl-l $line-M 5--connect-timeout 5-o/dev/null-s-W "$line"%{http_code} ""% {size_download} "\ n";d one
The previous time was added to see how long it took to execute the code.
%{http_code} means to return the HTTP status code, through which we can know whether the link is a normal 200 link or 404 error link;
%{size_download} means to return the current page size, if the value is too small, that these pages are likely to be a low quality of empty pages, you have to find ways to remove.