Document directory
- Four types of codes
- Principle
- Summary
- Application
Four types of codes
1. encoding used for webpage file storage. For example, when we use vim as the encoding, we can set the fileencoding attribute to set the encoding used for the stored file. set filetype = UTF-8, save the file as UTF-8 encoding.
2. the encoding set in the meta tag includes attributes such as http-equiv = "Content-Type" content = "text/html; charset = UTF-8, the encoding of the current page is set to UTF-8.
3. The encoding used for browser viewing, that is, view => in the browser is the encoding selected in the encoding option.
4. encoding used when the browser submits user data.
Principle
1. webpage file storage encoding, which is the most important encoding for webpages. If the webpage file is a static HTML file, the Web Server directly sends the file to the browser of the client. If the webpage file is a dynamically generated HTML file, the Web Server generates the encoded data based on the encoding stored in the dynamic script file, which is an HTML file sent to the Client Browser.
For example, in a PHP script stored in gbk encoding, if echo 'I love you' is used, six bytes of CE D2 B0 AE C4 E3 will be generated, these six bytes of data are 'I love you' GBK encoding, And if you execute echo' I love you' in a PHP script that is stored in UTF-8 encoding ', it will generate data E6 88 91 E7 88 B1 E4 BD A0 nine bytes of data, these nine bytes of data is 'I love you' UTF-8 encoding.
2. in HTML 4.01 Specification, the charset used in the Content-Type value of the META tag indicates the encoding of the currently transmitted HTML document, it also indicates that a conforming browser correctly processes this attribute. However, the actual situation is that most browsers do not treat this attribute as a problem. Firefox 10 and Chrome 17 are not follow, so there will be an HTML file that is clearly UTF-8 encoded, and in meta
Charset is designed as a UTF-8, but there will still be garbled reasons. Now, the browser (IE 9.0 + FF11 + Chrome17) will parse the html file according to the test (locally opened using FILE or HTTP.
3. the browser view encoding is the encoding used by the browser to decode the data transmitted by the Web Server. The reason for garbled out is this, if an HTML is sent to GBK encoding, and Web Browser uses the UTF-8 to decode this file, then if the file contains Chinese characters, garbled characters are generated.
4. tests show that the encoding used by the browser to submit user data depends only on the encoding used by the current browser to view the webpage. It has nothing to do with the encoding of the HTML webpage file.
Summary
The encoding of HTML files transmitted by the server is mainly determined by the storage encoding of HTML files or script files on the server. The encoding of the data transmitted by the browser is determined only by the viewing encoding of the browser.
In addition, charset in the Content-Type attribute of the HTTP header can also indicate the encoding of the data transmitted by the Server, but the general Web Server does not send this charset attribute.
Finally, modern browsers generally have an automatic encoding check function. The browser checks the Encoding Based on the received data.
Application
Understand why php's move_uploaded_file sometimes does not support Chinese file names?
Problem Reproduction
If there is such a background processing module for image uploads,
<? Php // assume that the client sends data in UTF-8 encoding, that is, when you view the file upload page, UTF-8 $ old_name = $ _ FILES ['file'] ['name']; $ new_name = 'e: \ web \\'. $ old_name; move_uploaded_file ($ _ FILES ['file'] ['tmp _ name'], $ new_name ); move_uploaded_file ($ _ FILES ['file'] ['tmp _ name'], 'e: \ web \ study \ ha..jpg ');?>
If this module (upload. php) using UTF-8 as the file storage encoding, so the code in the two ways to upload files should be avoided, it is best to be able to randomly generate a new file name after the upload, and the new file name should not contain Chinese characters.
Because:
First, the general Web Server (such as httpd) cannot process the path names in Chinese. That is to say, apache httpd cannot process the uploaded UTF-8 encoded file names correctly, the result is $ _ FILES ['file'] ['name'] With garbled characters. In this way, the old file name is wrong first.
Second, php's move_uploaded_file cannot process the file path based on the current file storage encoding. Move_uploaded_file will treat the transferred parameter as the current locale encoding (GBK), and the result will be garbled again.
For example, 'e: \ web \ study \ haha.jpg 'itself as the UTF-8 code is
65 3a 5c 77 65 62 5c 73 74 75 64 79 5c e5 93 88 e5 93 88 2e 6a 70 67
Move_uploaded_file treats it as GBK encoding, and the result is "e: \ web \ study \ zookeeper .jpg", because e5 93 is 'authorization ', next, 'hangzhou' is 88 e5, And the last 'hangzhou' is 93 88, that is, 'hangzhou''s six UTF-8 bytes are interpreted as three GBK Encoded chinese characters. The result is that the uploaded file is changed to "e: \ web \ study \ zookeeper .jpg", but not the original haha.jpg. Even more often, when a UTF-8-encoded word cannot be interpreted as GBK, the system will add? As the default character, the upload fails.
So, if you want to use a Chinese file name, then in the PHP file stored in the UTF-8 encoding, you must first use iconv to convert the UTF-8 encoded path to the current locale (gbk) encoding, then call move_uploaded_file.
In-depth exploration
So why does move_uploaded_file regard the UTF-8 encoded file path as gbk encoded? Is this a PHP bug?
Let's take a look at the php source code,
/* {{{ proto bool move_uploaded_file(string path, string new_path) Move a file if and only if it was created by an upload */PHP_FUNCTION(move_uploaded_file){char *path, *new_path;int path_len, new_path_len;zend_bool successful = 0;#ifndef PHP_WIN32int oldmask; int ret;#endifif (!SG(rfc1867_uploaded_files)) {RETURN_FALSE;}if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss", &path, &path_len, &new_path, &new_path_len) == FAILURE) {return;}if (!zend_hash_exists(SG(rfc1867_uploaded_files), path, path_len + 1)) {RETURN_FALSE;}if (PG(safe_mode) && (!php_checkuid(new_path, NULL, CHECKUID_CHECK_FILE_AND_DIR))) {RETURN_FALSE;}if (php_check_open_basedir(new_path TSRMLS_CC)) {RETURN_FALSE;}if (strlen(path) != path_len) {RETURN_FALSE;}if (strlen(new_path) != new_path_len) {RETURN_FALSE;}VCWD_UNLINK(new_path);if (VCWD_RENAME(path, new_path) == 0) {successful = 1;#ifndef PHP_WIN32oldmask = umask(077);umask(oldmask);ret = VCWD_CHMOD(new_path, 0666 & ~oldmask);if (ret == -1) {php_error_docref(NULL TSRMLS_CC, E_WARNING, "%s", strerror(errno));}#endif} else if (php_copy_file_ex(path, new_path, STREAM_DISABLE_OPEN_BASEDIR TSRMLS_CC) == SUCCESS) {VCWD_UNLINK(path);successful = 1;}if (successful) {zend_hash_del(SG(rfc1867_uploaded_files), path, path_len + 1);} else {php_error_docref(NULL TSRMLS_CC, E_WARNING, "Unable to move '%s' to '%s'", path, new_path);}RETURN_BOOL(successful);}/* }}} */
The main task is VCWD_RENAME,
#define VCWD_RENAME(oldname, newname) virtual_rename(oldname, newname TSRMLS_CC)
Virtual_name,
CWD_API int virtual_rename(char *oldname, char *newname TSRMLS_DC) /* {{{ */{cwd_state old_state;cwd_state new_state;int retval;int cch, cb;LPWSTR wstr;LPSTR mbstr;CWD_STATE_COPY(&old_state, &CWDG(cwd));if (virtual_file_ex(&old_state, oldname, NULL, CWD_EXPAND)) {CWD_STATE_FREE(&old_state);return -1;}oldname = old_state.cwd;CWD_STATE_COPY(&new_state, &CWDG(cwd));if (virtual_file_ex(&new_state, newname, NULL, CWD_EXPAND)) {CWD_STATE_FREE(&old_state);CWD_STATE_FREE(&new_state);return -1;}newname = new_state.cwd;/* rename on windows will fail if newname already exists. MoveFileEx has to be used */#ifdef TSRM_WIN32/* MoveFileEx returns 0 on failure, other way 'round for this function */retval = (MoveFileEx(oldname, mbstr, MOVEFILE_REPLACE_EXISTING|MOVEFILE_COPY_ALLOWED) == 0) ? -1 : 0;if (retval == -1) {php_error_docref(NULL TSRMLS_CC, 2, "movefileex failed");}#elseretval = rename(oldname, newname);#endifCWD_STATE_FREE(&old_state);CWD_STATE_FREE(&new_state);return retval;}/* }}} */
We can see that the final move_uploaded_file is to call MoveFileEx to upload the file, that is, to treat the path name of the UTF-8 Encoding As GBK encoding to handle the MoveFileEx function, so why does this result appear?
Because all the default Windows API calls in PHP use the ANSI version, that is, MoveFileEx is MoveFileExA, the parameter is naturally a GBK encoded string (the final system uses MultiByteToWideChar to convert the GBK encoded string into a UTF-16 LE string to call MoveFileExW ).
Therefore, to solve this problem through hard encoding at the underlying layer, you can add the following code before MoveFileEx:
/* first convert utf-8 to utf16-le */cch = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)newname, strlen(newname) + 1, NULL, 0);wstr = (LPWSTR)malloc(cch * sizeof(wchar_t));MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)newname, strlen(newname) + 1, wstr, cch * sizeof(wchar_t));/* then convert utf16-le to gbk */cb = WideCharToMultiByte(CP_ACP, 0, wstr, cch, NULL, 0, NULL, NULL);mbstr = (LPSTR)malloc(cb);WideCharToMultiByte(CP_ACP, 0, wstr, cch, mbstr, cb, NULL, NULL);free(wstr);
Converts a UTF-8-encoded string to GBK.
It can be seen from this that, in PHP or other systems, try not to use Chinese as the file name (if possible), because Chinese-encoded file operations are prone to compatibility issues.