First date header and compression with Googbot

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Google

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Google robot--what a magical dream boat! He knows our soul and every part. Maybe he's not looking for something unique; he has read about billions of other sites (although we also share our data with other search engine bots), but tonight, as a website and Google bots, we will really know each other.

I know it's never been a good idea to be overly analytical on a first date. We will learn a little bit about Google robots through a series of articles:

Our first date (tonight): The data headers issued by the Google Robot and the file formats he noticed were appropriate for compression;

Judge his response: Response code (301s, 302s), how he handles redirects and if-modified-since;

Next: With the link, make him crawl faster or more slowly (so he won't be too excited to cross his head).

Tonight is our first date ...

***************

Google Robot: Command answer correctly

Website: Google robot, you come!

Google Robot: Yes, I'm coming!

get/http/1.1

Host:example.com

Connection:keep-alive

Accept: */*

From:googlebot (at) googlebot.com

user-agent:mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)

Accept-encoding:gzip,deflate

Site: These headers are awesome! Whether or not my site is in America, Asia or Europe, do you crawl with the same headers? Have you ever used any other headers?

Google bots: In general, the headers I use around the world are consistent. I'm trying to figure out what a Web page looks like from the default language and settings of a website. Sometimes people's user agents are different, for example, AdSense read using the "Mediapartners-google":

User-agent:mediapartners-google

Or for image search:

user-agent:googlebot-image/1.0

The user agent for wireless reads varies by carrier, while Google Reader RSS reader contains additional information such as the number of subscribers.

I usually avoid cookies (so there is no such thing as a "Cookie:" header) because I don't want the information about a specific conversation to have a significant impact on the content. Also, if a server uses a dialog id on a dynamic URL instead of a cookie, I usually recognize it so that I don't have to crawl the same Web page over and over again because of the different dialog IDs.

Site: My structure is very complex. I am using many types of files. Your header says, "accept:*/*." Do you include all URLs, or do you automatically filter certain file extensions?

Google Robot: It depends on what I want to find.

If I'm just searching for a regular web search, I might not be able to download these things when I see links to MP3 and video content. Similarly, if I see a JPG file, the processing method naturally differs from HTML or PDF links. For example, the frequency of changes to JPG is often much lower than HTML, so I do not often check JPG changes to save bandwidth. Also, if I'm looking for links to Google's Academic search, my interest in PDF articles will be much higher than the interest in JPG files. For academics, downloading graffiti (such as JPG), or video about puppies playing skateboards, is easy to distract, you say?

Website: Yes, they may feel disturbed. I admire your professionalism pleasantly. I like graffiti painting myself (JPG), it's hard to resist their temptation.

Google Robot: I am the same. I'm not really a scholar all the time. If I crawl for a search image, I will be very interested in jpg, and I will take great pains to look at the HTML and the images near them when I hit the news.

There are also many extensions, such as EXE, DLL, ZIP, DMG, and so on, they are both large in number and not much use for search engines.

Web site: If you see my URL "http://www.example.com/page1.LOL111", (say) would you not shut it down just because the bread has an unknown file extension?

Google Robot: Web site buddy, let me give you some background information. Once a file is actually downloaded, I use the Content-category Content-type header to check that it belongs to HTML, image, text, or something else. If it is a special type of data such as a PDF, Word document, or Excel worksheet, I will confirm that it is in a valid format and extract textual content from it. But you can never be sure if it contains a virus. But if the documents or data types are confusing, I have nothing better to do than throw them away.

So, if I crawl your "http://www.example.com/page1.LOL111" url and find the unknown file name extension, I might download it first. If I can't figure out the content type from the header, or if it belongs to a file format that we refuse to retrieve (for example, MP3), then just put it aside. In addition, we will continue to crawl the file.

Website: Google Android, I'm sorry for your work style "nitpicking", but I noticed that your "accept-encoding" header says this:

Accept-encoding:gzip,deflate

Can you tell me something about these headers?

Google Robot: Of course. All mainstream search engines and Web browsers support gzip compression of content to conserve bandwidth. You may also encounter other types, such as "X-gzip" (Same as "gzip"), "deflate" (which we also support) and "identity" (not supported).

Website: Can you say more about file compression and "Accept-encoding:gzip,deflate"? Many of my URLs contain large flash files and wonderful images, not just HTML. If I compress a larger file, will it help you crawl more quickly?

Google Robot: There is no simple answer to this question. First, the file formats such as SWF (Flash), JPG, PNG, GIF, and PDF are already compressed (and dedicated Flash optimizer).

Website: Perhaps I have already compressed my flash file, I do not know. Obviously, my efficiency is very high.

Google Bots: Apache and IIS offer options to allow for gzip and deflate compression, and of course, the cost of bandwidth savings is more CPU-intensive. In general, this feature applies only to files that are more easily compressed, such as text html/css/php content. Also, it can only be used if the user's browser or I (search engine bots) allow it. Personally, I prefer "gzip" to "deflate". The gzip coding process is relatively reliable because it is constantly being added and checked, and maintaining a complete header, unlike the "deflate" that requires me to keep speculating in my work. In addition, both programs have similar compression algorithm languages.

If you have unused CPU resources on your server, you can try to compress (link: Apache, IIS). However, if you are providing dynamic content and the server's CPU is already in full load state, I suggest you not do so.

Website: Very long experience. I'm glad you could come to see me tonight. Thank goodness, my robots.txt file allows you to come. This document is sometimes like a parent who is overprotective of his or her own children.

Google Robot: It's time to meet your parents-it's robots.txt. I've seen a lot of crazy parents. Some of these are actually just HTML error message pages, not valid robots.txt. Some files are full of endless redirects and may point to completely unrelated sites. Others are bulky and contain thousands of separate URLs, each of which are different. Here's a side-effect file pattern, which, in general, wants me to crawl its content:

User: *

Allow:/

However, at the peak of a user's traffic, the site shifts its robots.txt to a highly restrictive mechanism:

# Can You go Moz for a while? I'll let your back

# Recycle in the future. Really, I promise!

User: *

Disallow:/

The problem with the above robots.txt file switching is that once I see this restrictive robots.txt, it's possible that I have to discard the content of the site that has crawled in the index. When I was approved to enter this site again, I had to crawl through a lot of the original content, at least temporarily 503 error corresponding code.

In general, I can only re-examine robots.txt every day (otherwise, on many virtual host sites, I spend a large part of my time reading robots.txt files, knowing that not many dates like to meet each other's parents so often). Webmasters through the robots.txt switch to control the crawling frequency is a side effect, a better way is to use webmaster tools will crawl frequency to "low" can be.

Google Robot: Web site buddy, thank you for your questions, you've been doing a good job, but now I have to say, "Goodbye, My Love."

Website: Oh, Google robot ... (end answer):)

Traveler's Website: http://www.1gu.org.cn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More