Transferred from: https://blog.robotshell.org/2012/deal-with-http-header-encoding-for-file-download/
Recently, a case was encountered in the project: a mandatory download is required (that is, forcing the download dialog box to prevent the browser from trying to resolve the display of certain file formats), and the file name must remain the same as the user uploaded earlier (possibly containing non-ASCII characters).
The previous requirement is easy to implement: You can use HTTP headers and work Content-Disposition: attachment
together Content-Type: application/octet-stream
to ensure that it is foolproof. The next requirement is the egg ache, which involves the coding problem of the Header (the file name is placed as Content-Disposition
the filename parameter). As we all know, the HTTP header Content-Type
can be specified in the content (body) of the code, the Header itself can be encoded and how to make? Even, does the Header allow non-ASCII encoding?
If you leave the coding problem regardless, then you will certainly encounter in a system and browser download files when the file name garbled, if you try to solve, then you will likely find a bunch of contradictory solutions (I can responsibly tell you that 99% of them are non-standard trick). Let's see how we can solve this problem gracefully and perfectly.
In order to explore this problem, I took a lot of detours. From their own attempts to Google (try to search in English and Chinese), and then to read Discuz and other classic projects of source code, opinions, consensus. Finally I think of the return to the RFC, from the standard documents to find ways, sure enough to reap. Since the process of inquiry is so tortuous, I will write down the standard approach first-it should be set up like this Content-Disposition
:
Content-disposition: attachment; FileName= "$encoded _fname"; FileName*=utf-8 '$encoded _fname
$encoded_fname
this refers to the UTF-8 encoded original file name according to RFC 3986 for the percent encoding (percent encoding) obtained after (using the function in PHP rawurlencode()
). These lines can also be combined into one line (it is recommended to use a space separated).
In addition, to be compatible with IE6, ensure that the original file name must include the English extension !
Bottom
Let's take a look at why we're doing this and why we can do that.
First, according to the HTTP 1.1 protocol defined by RFC 2616 (RFC 2068 is the earliest version; 2616 replaces 2068 and is most widely used, and then replaced by other RFCs, which is mentioned later), the HTTP message format is based on the ancient ARPA Internet The Text Messages, while the ARPA message can only be ASCII encoded (RFC 822 section 3). RFC 2616 Section 2.2 is again emphasized that text (the field value in section 4.2:header is text) in order to use a different character set, the string must be encoded/escaped using the rules of RFC 2047-it is important to note that this rule originally is an extension for MIME (e-mail), and the format is very different from the percent-semicolon encoding. Give an example in MIME:
Subject: =? Iso-8859-1? B? swygew91ignhbibyzwfkihroaxmgew8=?=
When RFC 2616 was introduced in 1999, Content-Dispostion
the Header was not yet part of the formal HTTP protocol, but was borrowed directly from the MIME Standard (RFC 2616 section 19.5.1) because it was widely used. Thus there is almost no browser to support Content-Disposition
the multi-language encoding feature such a "Extended feature extension feature". In fact, the feature recommended in RFC 2616 for multilingual encoding using RFC 2047来 has never been supported by mainstream browsers, so we don't have to worry about this MIME scheme ...
But this problem is really necessary, so the browser has come up with a number of ways:
- IE supports the use of the percent-encoding directly in filename:
filename="$encoded_text"
(not MIME-encoded!) )。 Originally, according to RFC 2616, if the part of the quotation mark is not MIME-encoded, it should be treated as content directly, even if it "looks like a percent-encoded string", but IE will "automatically" decode such a file name if the file name must have one that is not encoded (i.e. ASCII) suffix name !
- Some other browsers support a more brutal approach: Allow
filename="TEXT"
UTF-8 encoded strings to be used directly in! This is also a direct violation of the RFC 2616 HTTP header must be an ASCII encoding requirement.
The behavior of these two types of browsers is incompatible with each other. So you can judge UA and then use the previous approach to IE, other browsers use the latter one, so that you can generally be able to just work effect (Discuz is doing so). For Opera and Safari, however, this may not necessarily be effective.
ERA in progress, 2010 RFC 5987 Released, formally specifies the HTTP Header in the format of the processing of multi-language encoding parameter*=charset‘lang‘value
, wherein:
- CharSet and Lang are case insensitive.
- Lang is the language used to label fields for reading software recitation or special rendering based on language features, which can be left blank.
- Value uses percent encoding according to RFC 3986 Section 2.1, and specifies that the browser should support at least ASCII and UTF-8.
- The browser should use the latter when parameter and parameter* appear in the HTTP header at the same time.
The advantage is that the forward compatibility is maintained: One HTTP header is still ascii-only, and the older browsers that do not support this standard will use the parameter* as a field name in accordance with RFC 2616 of the year, thus ignoring it as an unknown. Subsequently, the 2011 RFC 6266 was released, formally Content-Disposition
incorporating the HTTP standard, and again emphasizing the multi-language encoding method in RFC 5987, and an example was given to resolve backward compatibility issues:
Content-disposition: attachment; FileName= "EURO rates"; FileName*=utf-8 '%e2%82%ac%20rates
In this example, the value of filename is a synonym for the English phrase-this is in accordance with RFC 2616, the ordinary field should not be encoded, and the use of UTF-8 is only because it is mandatory in the standard must be supported. However, if we think about it again-the current market is often the old version of the browser more than IE. As a result, we can make the appropriate modifications by using the FileName field directly with the percent-encoded string:
Content-disposition: attachment; FileName= "%e2%82%ac%20rates.txt"; FileName*=utf-8 '%e2%82%ac%20rates.txt
Newer Firefox, Chrome, Opera, Safari, and other browsers support and use the new standard filename*, even if they do not automatically decode filename, and for older versions of IE, they do not recognize Filena me*, it will automatically ignore and use the old filename (the only minor flaw is the need to have an English suffix name). This is the perfect solution to multi-browser multi-language compatibility issues, neither need UA judgment, but also more consistent with the standard.
p.s. Why does PHP use rawurlencode()
functions? Because this is the "percent URL encoding" that really conforms to RFC 3986, just because of historical reasons, a urlencode()
function was used to implement similar coding rules in HTTP POST, so a strange name was used. The difference between the two is that the former will encode the space as%20, while the latter will encode the + number. If you use the latter, IE6 will change to a plus sign when downloading a file name with spaces. In general, you will not be able to use urlencode()
this function (a bug in which some versions of Discuz use it incorrectly for file name encoding, resulting in a space variable plus sign).
Related Posts:
- Content Security Policy causes Bookmarklet to fail
- WordPress Arras Theme Simplified Chinese translation file
Encoding of HTTP headers when downloading files is handled correctly (content-disposition)