Today's project encountered the need to use JavaScript escape encoding Chinese characters and then use unescape decoding, the test Code section when there is a garbled situation.The details are as follows:First, open the test page with EditPlus test.html and edit the following HTML code:
Page printout:
%ufffd%u0171%ufffd%u05ae%ufffd%
Problem of converting UTF-8 from htmlentities to Chinese $ str = "Chinese"; echo nbsp; json_encode ($ str); displayed as: [null] nbsp; so I plan to use htmlentities nbsp; please refer to the following link for details: $ str = htmlentities ($ str, UTF-8); echo nbsp; j: converting htmlentities to UTF-8 in Chinese
$ Str = "Chinese ";
Echo json_encode ($ str );
Shown:
[Null]
So I want to use htmlentities to turn it over:
$ Str = htmlentities ($ str, UTF-8 );
Echo json_encode ($ str )
Python's annoying copy (\ xef \ xbf \ xbd) and xefxbf
First, you must know what \ xef \ xbf \ xbd is.
>>> U' \ uFFFD '. encode ('utf-8')' \ xef \ xbf \ xbd'
From this we can know that \ xef \ xbf \ xbd is utf8 encoded '\ uFFFD', so what is this '\ uFFFD?
It turns out that there must be some words in the conversion process of Unicode and the old encoding syste
format:
1^2345|678
Number Format pattern Syntax
can design your own format patterns for numbers by following the rules specified by the following BNF:Pattern: = Subpattern{;subpattern}Subpattern: = {Prefix}integer{.fraction}{suffix}Prefix: = '//u0000 ' ... ' Ufffd '-specialcharactersSuffix: = '//u0000 ' ... ' Ufffd '-specialcharactersInteger: = ' # ' * ' 0 ' * ' 0 'Fraction: = ' 0 ' * ' # ' *
The notation
of Chinese characters overlaps ( In fact, this mapping is just coded mapping, in the display of careful is not the same. The symbol in Unicode is single-byte, and the symbol in the Chinese character is double byte wide. There are 20 such symbols between the unicode\u00a0--\u00ff. It is very important to understand this feature! It is not difficult to understand why in Java programming, encoding error results often appear garbled (in fact, symbolic characters), not all of '? ' Characters, just l
characters:!, ', (,), *,-,.,_,~,0-9,a-z,a-zExample:Alert (encodeURIComponent ("at plastic")); // a%26t%20plasticalert (Escape ("at plastic")); // a%26t%20plasticalert (encodeURI ("at plastic")); // at%20plasticalert (Escape ("at plastic")); // A%26T%20PLASTIC%UFFFD%UFFFDWe see that encodeURI has no encoded URI reserved character , ' Medium ' is encoded as %UFFFD%
mapping of the encoding, when displayed carefully is not the same. The symbols in Unicode are single-byte-wide, and the symbols in Chinese characters are double-byte-wide. There are 20 of these symbols between unicode\u00a0--\u00ff. It is important to understand this feature! It is not difficult to understand why Java programming, Chinese character coding error results often appear in some garbled (in fact, symbolic characters), and not all "?" Characters, like the example above.Byte-->unicode,
.
Byte-->unicode, if the byte identity character does not exist in the source code set, the result is 0xfffd.
Such as:
byte ba[] = {(byte) 0x81, (Byte) 0x40, (Byte) 0xb0, (byte) 0xa1}; New String (BA, "??????");
The result is "Ah," and the hex value is "\ufffd\u554a". 0x8140 is a GBK character, press?????? The conversion table does not have a corresponding value, take \ufffd. (Note: When this Unicode is di
'? 'Character, such as the above example.
Byte --> Unicode. If the character identified by Byte does not exist in the source code set, the result is 0xfffd.For example:Byte ba [] = {(byte) 0x81, (byte) 0x40, (byte) 0xb0, (byte) 0xa1}; new String (ba, "?????? ");The result is "? Ah ", the hex value is" \ ufffd \ u554a ". 0x8140 is a GBK character. Press ?????? The conversion table does not have the corresponding value. \
two forms: \uhhhh corresponding to the 16bit code point value, \uhhhhhhhh corresponding to the 32bit code point value
World
"\xe4\xb8\x96\xe7\x95\x8c"
"\u4e16\u754c"
"\u00004e16\u0000754c"
The above three escape sequences provide an alternative notation for the first string, but their values are the same.
The value of less than 256 yards can be written in a hexadecimal escape byte, such as ' \x41 ' corresponding to the character ' A ', but for larger code points you must use the \u or \u esca
characters
\f Page Break
\ n line break
\ r return character
\ t tab is tab
\u allows you to specify a Unicode character to represent a 16-binary constant
\d is equivalent to [0-9],\d the opposite, equivalent to [^0-9]
\s is equivalent to [\f\n\r\t\u000b\u0020\u00a0\u2028\u2029]. This is an incomplete subset of Unicode whitespace characters, and \s is just the opposite
\w is equivalent to [0-9a-z_a-z],\w the opposite, \w wants to represent t
provided. There is a GB2312 in the Chinese character "Li", which is encoded as "c0ee" and wants to be converted into iso8859-1 encoding. The steps are: first the word "Li" into Unicode, get "674E", and then "674E" into iso8859-1 characters. Of course, this mapping will not succeed because there is no character in the iso8859-1 that corresponds to "674E".
When the mapping is unsuccessful, the problem occurs! When converting from a language to Unicode, if there is no such character in a language,
byte identity character does not exist in the source code set, the result is 0xfffd.Such as:byte ba[] = {(byte) 0x81, (Byte) 0x40, (Byte) 0xb0, (byte) 0xa1}; New String (BA, "gb2312");The result is "Ah," and the hex value is "ufffdu554a". 0x8140 is a GBK character, the GB2312 conversion table does not have the corresponding value, take ufffd. (Note: When this Unicode is displayed, because there is no corresponding local character, the previous case i
in the range loop, if an incorrect UTF8 encoded input is encountered, a special Unicode character ' \ufffd ' is generated, which in print is usually a black hexagonal or diamond shape, It contains a white question mark (?). String and byte slicesFour packages in the standard library are especially important for string handling: bytes, strings, StrConv, and Unicode packages.The Strings package provides many functions such as querying, replacing, compa
incorrect UTF8 encoded input is encountered, a special Unicode character will be generated ' \ Ufffd ', in print this symbol is usually a black hexagonal or diamond shape that contains a white question mark (?).
String and byte slices
Four packages in the standard library are especially important for string handling: bytes, strings, StrConv, and Unicode packages.
The Strings package provides many functions such as querying, replacing, comparing, tr
operation することです. jp-char:あアいイうウえエおオkr:정규표현식은매우유용한도구텍스트를조작하는것입니다. PUC:.?! 、,;:“ ”‘ '——......· -·《》〈〉! ¥%*# ' #let ' s look its raw representation under the Hood:print "the raw UTF8 string is:\n", repr (sample) print #find t He non-ascii Chars:findpart (r "[\x80-\xff]+", Sample, "Non-ascii") #convert the UTF8 to Unicode usample=unicode (sample, ' UTF8 ') #let ' s look its raw representation under the Hood:print "the raw Unicode string is:\n", repr (usample) print #get E Ach language Parts:findpa
invalidKr.Puc :.?! ,;: "" ''--... · "'''# Let's look its raw representation under the hood:Print "the raw utf8 string is: \ n", repr (sample)Print# Find the non-ascii chars:FindPart (r "[\ x80-\ xff] +", sample, "non-ascii ")# Convert the utf8 to unicodeUsample = unicode (sample, 'utf8 ')# Let's look its raw representation under the hood:Print "the raw unicode string is: \ n", repr (usample)Print# Get each language parts:FindPart (u "[\ u4e00-\ u9fa5] +", usample, "unicode chinese ")FindPart (u
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.