Differences between UTF-8 and GBK encodings (page encoding, Database coding differences) and applications in real-world projects

Last Update:2017-06-16 Source: Internet

Author: User

Tags coding standards php sample code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Section I: UTF-8 and GBK Coding overview

UTF-8 (8-bit Unicode transformation Format) is a variable-length character encoding for Unicode, also known as the Universal Code, which contains the characters that all countries in the world need to use, is an international code, is highly versatile, is a multi-byte encoding used to solve international characters. Created by Ken Thompson in 1992. UTF-8 encodes Unicode characters from 1 to 4 bytes, it uses 8-bit/8bit (1 bytes/1byte) for English, and Chinese is encoded using 24-bit/24bit (3 bytes/3byte) . You can use the same page to display Chinese simplified traditional and other languages (such as Japanese, Korean) on the page.

GBK (Chinese Internal Code specification) is one of the Chinese character coding standards, the full name of the code expansion code of Chinese characters, the National Information Technology Standardization Technical Committee of the People's Republic of China December 1, 1995, The Ministry of Technology and Quality Supervision of the Department of Standardization of the National Technical Supervision Bureau (CST), December 15, 1995, jointly identified it as a guidance document for technical specifications in the form of 1995 No. 229.
GBK is the national standard GB2312 on the basis of the expansion of compatible GB2312 standards (GB2312 A total of 7,445 characters, including 6,763 characters and 682 other symbols; GBK A total of 27,484 Chinese characters, while also included in the Tibetan, Mongolian, Uighur language and other major minority characters ). The literal encoding of GBK is expressed in double-byte notation, that is, both Chinese and English characters are represented by double-byte (note thatGB series encoding is distinguished by the highest bit in bytes and ASCII encoding, and can be mixed with ASCII code.) So in full-width mode English is 2 bytes, half-width mode in English or 1 bytes ). In order to differentiate the Chinese, the highest bit is set to 1. GBK contains all Chinese characters, is the country code, the generality is worse than the UTF8, but UTF8 occupies the database bigger than GBD .

A simple overview is:
UTF-8 English 1 bytes Chinese 3 bytes, between coding efficiency and coding security to balance, suitable for network transmission, is the ideal Chinese encoding.
GBK English 1 bytes (half width 1 bytes, full width 2 bytes), Chinese 2 bytes, GBK range is wider than GB2312, GBK compatible GB2312.

Reference article:
http://blog.csdn.net/mydriverc2/article/details/50525203
http://blog.csdn.net/liudajiang/article/details/41133077
Http://www.cnblogs.com/xiaomia/archive/2010/11/28/1890072.html

Section II: Differences between UTF-8 and GBK in web transport

PHP Sample code:

$str = ' Chinese '; 2 Chinese
echo strlen ($STR), ', '; UTF-8 length is: 6 (UTF-8 code: 1 Chinese 3byte,2 Chinese add up is 6byte)
Echo strlen (Iconv (' utf-8 ', ' GBK ', $str)), ', '; GBK length is: 4 (GBK code: 1 Chinese 2byte,2 Chinese add up is 4byte)

$str = ' A0 '; 1 English 1 digits
echo strlen ($STR), ', '; UTF-8 length is: 2 (UTF-8 code: 1 English or number is 1byte,1 English and 1 numbers Plus is 2byte)
Echo strlen (Iconv (' utf-8 ', ' GBK ', $str)), ', '; GBK length is: 2 (GBK code: 1 English or number is 1byte, 1 English and 1 numbers plus 2byte)

$str = ' E Han '; 1 English 1 Chinese
echo strlen ($STR), ', '; UTF-8 length is: 4 (UTF-8 code: 1 Chinese 3byte,1 English 1byte, add up is 4byte)
Echo strlen (Iconv (' utf-8 ', ' GBK ', $str)); GBK length is: 3 (GBK code: 1 Chinese 2byte,1 English or digital is 1byte, add up is 3byte)

Code:

Output Result:

Section III: Differences between UTF-8 and GBK in database storage

====================mysql Test ====================

MySQL Sample code (UTF8 encoding):
--Create a table with the specified UTF8 encoding
CREATE TABLE Chartestutf8 (fstr varchar (2), FCHR char (2)) DEFAULT Charset=utf8;
--Write test data
Insert into Chartestutf8 (FSTR, FCHR) VALUES (' Chinese ', ' Chinese ');
Insert into Chartestutf8 (FSTR, FCHR) VALUES (' E text ', ' e ');
Insert into Chartestutf8 (FSTR, FCHR) values (' A0 ', ' A0 ');
--Querying data table contents
Select Fstr as ' UTF8 variable length content ', Length (fstr) as ' UTF8 variable length ', fchr as ' UTF8 fixed length ', Length (FCHR) as ' UTF8 fixed length ' from Chartestutf 8;

Code:

Content Description:
UTF-8 encoding Unicode characters from 1 to 4 bytes, English one byte/1byte (8 bit/8bit), Chinese three bytes/3byte (24 bit/24bit)
The length of the ' Chinese ' 2 kanji is 3byte * 2 = 6byte
The length of ' e ' 1 English + 1 kanji is 1byte + 3byte = 4byte
' A0 ' 1 English + 1 digit length is 1byte + 1byte = 2byte

MySQL Sample code (GBK encoding):

--Create a table with the specified GBK encoding
CREATE TABLE CHARTESTGBK (fstr varchar (2), FCHR char (2)) DEFAULT CHARSET=GBK;
--Write test data
Insert into CHARTESTGBK (FSTR, FCHR) VALUES (' Chinese ', ' Chinese ');
Insert into CHARTESTGBK (FSTR, FCHR) VALUES (' E text ', ' e ');
Insert into CHARTESTGBK (FSTR, FCHR) values (' A0 ', ' A0 ');
--Querying data table contents
Select Fstr as ' gbk variable length content ', Length (fstr) as ' gbk variable length ', fchr as ' gbk fixed length ', Length (FCHR) as ' gbk fixed length ' from CHARTESTGBK;

Code:

Content Description:
The literal encoding of GBK is expressed in double-byte notation, that is, both Chinese and English characters are represented by double-byte characters.
The length of the ' Chinese ' 2 kanji is 2byte * 2 = 4byte
The length of ' e ' 1 English + 1 kanji is 1byte + 2byte = 3byte
' A0 ' 1 English + 1 digit length is 1byte + 1byte = 2byte

Note: varchar (n), where n refers to the number of characters, not the number of bytes. The number of bytes occupied is related to encoding.

Reference article:
http://www.oschina.net/question/199396_37127

====================sql Server Test ====================

SQL Server sample code (VARCHAR):
--Create a table
CREATE TABLE Chartest (fstr varchar (4), FCHR char (4));
--Write test data
Insert into Chartest (FSTR, FCHR) VALUES (' Chinese ', ' Chinese ');
Insert into Chartest (FSTR, FCHR) VALUES (' E text ', ' e ');
Insert into Chartest (FSTR, FCHR) values (' A0 ', ' A0 ');

--Querying data table contents
--The DATALENGTH () function returns the number of bytes, a character two bytes
Select Fstr as ' variable length content ', datalength (fstr) as ' content length ', fchr as ' fixed length content ', datalength (FCHR) as ' content length ' from chartest;
--LEN () function returns the number of characters, one character representing one
Select Fstr as ' variable length content ', Len (fstr) as ' number of characters ', FCHR as ' fixed-length content ', Len (FCHR) as ' number of characters ' from Chartest;

Code:

SQL Server sample code (NVARCHAR):

--Create a table
CREATE table chartestn (Fstr nvarchar (2), FCHR nchar (2));
--Write test data
Insert into CHARTESTN (FSTR, FCHR) VALUES (' Chinese ', ' Chinese ');
Insert into CHARTESTN (FSTR, FCHR) VALUES (' E text ', ' e ');
Insert into CHARTESTN (FSTR, FCHR) values (' A0 ', ' A0 ');
--Querying data table contents
--The DATALENGTH () function returns the number of bytes, a character two bytes
Select Fstr as ' variable length content ', datalength (fstr) as ' content length ', fchr as ' fixed length content ', datalength (FCHR) as ' content length ' from chartestn;
--LEN () function returns the number of characters, one character representing one
Select Fstr as ' variable length content ', Len (fstr) as ' number of characters ', FCHR as ' fixed-length content ', Len (FCHR) as ' number of characters ' from CHARTESTN;

Code:

Content Description:
The length of the ' Chinese ' 2 kanji is 2byte * 2 = 4byte, so the error will be prompted to truncate the string or binary data.
The length of ' e ' 1 English + 1 kanji is 1byte + 2byte = 3byte
' A0 ' 1 English + 1 digit length is 1byte + 1byte = 2byte
SQL Server fixed-length data type (char (n)), insufficient to fill the English half-width space, so the length of the query is fixed.

Project Application:
varchar is stored in actual byte length, 1 Kanji 1 bytes, 1 English 1 bytes, length 1-8000.
Nvarchar is stored in the number of characters, whether Chinese or English, is 2 bytes, length 1-4000.

Additional knowledge:

The difference between varchar and nvarchar in SQL Server:
1. VarChar is an English and a kanji both stand two bytes, length between 1 and 8000, storage size is the actual length of bytes of input data
2. Nvarchar is an English-language account of a byte, the Kanji account for two bytes, the length between 1 and 4000, the storage size is the number of characters entered twice times (n prefix, n for Unicode characters, that is, all characters accounted for two bytes)
3. From the storage mode, nvarchar is stored by character, and varchar is stored by byte
4. In terms of storage, varchar is more space-saving because the storage size is the actual length of bytes, while nvarchar is a double-byte storage
5. If you are doing a project that may involve conversion between languages of different countries, it is recommended to use nvarchar because nvarchar uses Unicode encoding, which reduces the chance of garbled characters
6. Char/nchar fixed length data type, insufficient to fill English half-width space.

LEN () Function: Returns the number of characters (not bytes) of the given string expression, with no trailing spaces. (Len returns only the number of characters, and one character for one)
Datalength () Function: Returns the number of bytes occupied by any expression. (Datalength returns the number of bytes, one kanji two bytes)
Len () does not contain spaces in length, while datalength () contains spaces.

Reference article:
Http://www.cnblogs.com/14lcj/archive/2012/07/08/2581234.html
http://blog.163.com/rihui_7/blog/static/212285143201211123342333/?NdsKey=246770

Copyright NOTICE: This document is licensed under the attribution-Non-commercial use-sharing (CC BY-NC-SA 3.0 CN) International License Agreement, please specify the author and source.
This article title: The difference between UTF-8 and GBK encoding (page encoding, Database coding differences) and the application in the actual project
This article link: http://www.cnblogs.com/sochishun/p/7026762.html
This article Sochishun (e-mail: 14507247#qq.com | blog: http://www.cnblogs.com/sochishun/)
Published: June 13, 2017

Differences between UTF-8 and GBK encodings (page encoding, Database coding differences) and applications in real-world projects

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More