Packagecom. Alex.base;Importjava.io.UnsupportedEncodingException; /*** Encoding of converted strings*/ Public classChangecharset {/**7-bit ASCII character, also known as the basic Latin block of the iso646-us, Unicode character set*/ Public Static FinalString us_ascii = "Us-ascii"; /**ISO Latin alphabet, also known as Iso-latin-1*/ Public Static FinalString iso_8859_1 = "Iso-8859-1"; /**8-bit UCS conversion format*/ Public Static Fi
Python's handling of multi-language is good, it can handle characters that are now arbitrarily encoded, and here is a deep look at Python's handling of many different languages.
One thing you need to be aware of is that when Python is going to encode a conversion, it uses the internal encoding, which is the conversion process:
Copy Code code as follows:
Original encoding-> internal encoding-> Purpose coding
Python's interior is handled using Unicode, but the use of Unicode
TEM, multibyte character system).
With this MBCS concept, we can express more characters, such as we have bits in 2 ASCII characters, and in theory there are 2 16 times 65,536 characters. But how are these encodings assigned to characters? For example, word-of-mouth "mouth" of the Unicode code is 21475, who decided? Character set, which is the charset that just introduced. ASCII is the most basic character set, and on top of that, we have a character set similar to gb2312, Big5, for MBCS in Sim
Yesterday, a colleague encountered a strange problem, that is, the following code, can not pass the JSON checksum, and can not be resolved through the PHP json_decode function.
Copy Code code as follows:
[
{
"title": "",
"Pinyin": ""
}
]
Probably smart you've guessed it contains a special character that you don't see, and you can view it under vim:
Copy Code code as follows:
[
{
"Pinyin": ""
}
]
Found in the "title" preceded by a charac
Before starting this article, I've already made a distinction between Unicode encoding (that is, code point) and Unicode encoding implementation. Otherwise, you will have no sense in the following.
History
We know that the ISO 10646 committee defines a super character set called Universal Character Set (UCS) to encompass all the writing systems in the world. Because the UCS is now encoded in 4 bytes, it is
: Currently the latest version of the Unicode character set contains more than 100,000 characters in various languages.ü Unicode encoding : (narrow Unicode encoding may refer to UCS-2, or it may refer to UTF-16 The generalized Unicode encoding can refer to a number of encoding implementations of the Unicode standard, including the following four types. )1. UTF-32 encoding : Fixed use of 4 bytes to represent a character, there is a problem of space ut
First, the character set summary In fact, most of the knowledge in this article has been very clear. Here is just a talk about their own sentiment. 1. Although UTF-8 begins with UTF (Unicode transfermation format), he is not Unicode in real sense. He was re-coded on the UCS. Moreover, this is a variable-length encoding method. 2. According to the article, at the same time as the ISO development of UCS (Univ
{ Tmp.append (Src.substring (Lastpos, POS)); Lastpos = pos; } } } return tmp.tostring ();}
The code logic is simple, parsing 2 width [0-255] and 4 width [4096-65535] characters, respectively.But there are 2 questions: 3 width [256-4095] The character designators does not exist? Does the width of more than 4 characters exist? If present, this code has a serious bug that can cause parsing to fail.Let's start with the first question:The East A
Import java.io.unsupportedencodingexception;/** * Convert string encoding */public class Changecharset {/** 7-bit ASCII character, also known as iso646-us, Un Basic Latin block of icode character set */public static final String us_ascii = "Us-ascii"; /** ISO Latin alphabet, also known as iso-latin-1 */public static final String iso_8859_1 = "Iso-8859-1"; /** 8-bit UCS conversion format */public static final String utf_8 = "UTF-8"; /** 16-bit
. The first 00 06 is the addition of writeutf, which is the number of bytes. The next six bytes are the UTF Encoding of "hello, 3 bytes for each Chinese CharacterThe second one is 4f 60 59 7d. This is the Unicode code of "hello" Big endian. Each Chinese character contains 2 bytes.The third is 60 7d, which is the low byte of two Chinese characters obtained from 4f 60 59 7d respectively.Further descriptionUse NotePad to save different encoding files. The file header has some tags to identify the e
First, get the encoding format of the fileWhen we use the file input and output stream, there are often garbled problems, which is usually caused by the encoding format.To copy a document as an example:We read the file with the input stream (FileInputStream) and then re-write it to another file with the output stream (FileOutputStream).If the encoding format of the source file is inconsistent with the encoding format that we re-write, there may be garbled problems.Therefore, we need to obtain th
Chinese characters, but some characters and numbers will be lost ,.
$code=preg_replace("#\\\u([0-9a-f]+)#ie","iconv('UCS-2','UTF-8',pack('H4','\\1'))",$code);print$code;
Share:
------ Solution --------------------
$ Code = preg_replace_callback ('/\\\\ u ([0-9a-f] {4})/I', create_function ('$ matches ', 'return mb_convert_encoding (pack ("H *", $ matches [1]), "UTF-8", "UCS-2BE"); '), $ str );
You can mo
bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.
The issue of Chinese encoding needs to be discussed in a specific article. This note does not cover this issue. It is only pointed out that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 of the subsequent text.
3. Unicode
Unicode character set (
Unicode needs to consider its encoding format has two, one is the UCS-2, it has a total of 65536 yards, the other is the UCS-4, which has 2147483648g code bit. Python supports both formats. This is specified by -- enable-Unicode = ucs2 or -- enable-Unicode = ucs4 during compilation. How can we determine the encoding of Python installed by default? One way is to judge through the value of SYS. maxunicode:
ASCII is a character set, including uppercase and lowercase English letters, numbers, and control characters. It is represented in one byte and ranges from 0 to 127.
Because ASCII characters are very limited, each country or region puts forward its own character set on this basis. For example, gb2312, which is widely used in China, provides encoding for Chinese characters, it is expressed in two bytes.
These character sets are incompatible with each other. The same number may indicate diff
-US and Unicode Character Set */Public static final string us_ascii = "US-ASCII ";/*** // *** ISO Latin alphabet No.1, also known as ISO-LATIN-1 */Public static final string iso_8859_1 = "ISO-8859-1 ";/***** // ** 8-bit UCS conversion format */Public static final string utf_8 = "UTF-8 ";/*** // *** The 16-bit UCS conversion format. The big endian (the lowest address stores the high byte) byte sequence */Pub
text, as shown in the following table:
Character Set/encodingEf bb bf UTF-8Fe ff UTF-16/UCS-2, little endianFf fe UTF-16/UCS-2, big endianFf fe 00 00 UTF-32/UCS-4, little endian.00 00 fe ff UTF-32/UCS-4, big-endian.
For example, after inserting a tag, the UTF-16 (big endian) that connects the word "and the UTF-8 Code
Python's handling of multiple languages is well supported, and it can handle any character that is now encoded, and here's a deep look at Python's handling of many different languages.
One thing to be clear about is that when Python is going to do the transcoding, it will use the internal code, and the conversion process is:Copy the Code code as follows:
Original code, internal code, and purpose code
The interior of Python is handled using Unicode, but the use of Unicode takes into account th
Php, we want to convert the uft-8 to unicode, you can use the following function to implement UTF encoding
The UTF-8 is coded in 8 bits. The encoding from UCS-2 to UTF-8 is as follows:
UCS-2 coding (hexadecimal)
UTF-8 byte stream (binary)
0000-007F
0 xxxxxxx
0080-07FF
110 xxxxx 10 xxxxxx
0800-FFFF
1110 xxxx 10 xxxxxx 10 xxxxxx
For example, the Unicode code of the Chinese character is 6C49. 6C49 is between 0
character in one byte is called a single-byte character set (Singlebyte charsets). In the GB2312 character set, ASCII characters are still stored in a single byte, in other words, the ASCII is a subset of the character set.GB2312 only contains thousands of commonly used Chinese characters, often can not meet the actual needs, therefore, people to expand it, this has our widely used GBK character set, GBK is currently the default character set of Windows and some other Chinese operating systems.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.