The relationship between Android Java Unicode

Source: Internet
Author: User

Background
Use regular expressions to find emoji characters for filtering
1. By
Http://apps.timwhitlock.info/emoji/tables/unicode determines the emoji's character code point range between \U1F600-\U1F6FF
Need to see the Unicode code points and UTF-8 UTF-16 UTF-32 children's shoes can refer to this article
Http://www.ruanyifeng.com/blog/2014/12/unicode.html

2. Start matching with regular
String is the one that contains the emoji Unicode code point
if (Pattern.compile("[\u1F600-\u1F6FF]"),
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE ).matcher(string).find()) {
return "有emoji字符无法创建 ucs-4";
}

Match not successful
But using
if (Pattern.compile("[\ud83c\udc00-\ud83c\udfff]|[\ud83d\udc00-\ud83d\udfff]|[\u2600-\u27ff]",
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE ).matcher(text).find()) {
return res.getString(R.string.move_content_invalid_emoji);
}

But you can match it, it's confusing.
Then use the kill technique, print the byte stream for data analysis
Found
UTF-8 \xf0\x9f\x98\x89\xf0\x9f\x98\x89\xf0\x9f\x98\x89\xf0\x9f\x98\x89\xf0\x9f\x98\x89
UTF-16BE \xd8\x3d\xde\x09\xd8\x3d\xde\x09\xd8\x3d\xde\x09\xd8\x3d\xde\x09\xd8\x3d\xde\x09
UTF-16LE \x3d\xd8\x09\xde\x3d\xd8\x09\xde\x3d\xd8\x09\xde\x3d\xd8\x09\xde\x3d\xd8\x09\xde
UTF-32BE \x00\x01\xf6\x09\x00\x01\xf6\x09\x00\x01\xf6\x09\x00\x01\xf6\x09\x00\x01\xf6\x09
UTF-32LE \x09\xf6\x01\x00\x09\xf6\x01\x00\x09\xf6\x01\x00\x09\xf6\x01\x00\x09\xf6\x01\x00

A string of emoji characters using different ways to get a byte stream is the case above
The \u1f600 we used obviously belonged to the Utf-32be Unicode implementation.
\ud83c\udc00 obviously belongs to the Utf-16be Unicode implementation mode
Then, then, it's confusing.
Starting to wonder if Android is UTF-8 encoded, why is the runtime using utf-16be correct?
began to wonder how pattern was using the regular expressions I passed in.
Try using the new String (Text.getbytes ("UTF-8"), "Utf-32be"); Conversion string encoding (this is a serious error)
And then it's not going to work, and then
and then fainted.

By looking at Http://www.2cto.com/kf/201303/195387.html's article here, I can communicate with colleagues to understand 3 concepts
1. Code used in Java
2. JVM platform Default character set
3. Encoding of external resources.

Java's class file is encoded in UTF8,
JVM Runtime with UTF16
The Java string is Unicode-encoded
Summarize the above meaning that the string object must be UTF16 encoded, either from a class file or from an external resource.

That is, it can be understood as the new String (Text.getbytes ("UTF-8"), "Utf-32be") This is the byte stream of the text that is obtained in the UTF-8 way and then the byte stream is parsed using the Utf-32be method, The last string that generates the UTF-16 method

An example is reiterated below
String encoding Myth:

Java code
new String(input.getBytes("ISO-8859-1"), "GB18030")
What does the code above represent? Someone would say, "convert the input string from ISO-8859-1 encoding to GB18030 encoding."
If this is true, then how do we explain that the Java strings we just mentioned are Unicode encoded?

This is not only a defect, but also a big mistake, let us hit analysis, in fact, the fact is this: we should have used
GB18030 code to read the data and decode it into a string, but the result is iso-8859-1 encoded, resulting in an incorrect word
Character string. To recover, the string is restored to the original byte array, and then decoded again into a string by the correct encoding GB18030 (that is, the GB18030 encoded data is converted to a Unicode string). Note that the string is always Unicode encoded.
But the code conversion is not negative negative is so simple, here we can correctly convert back, because Iso8859-1 is a single byte encoding, so each byte is converted to a String as is, that is, although this is a wrong conversion, but the encoding does not change, So we still have a chance to convert the code back!
Dalvik JVM Use UTF-16 there seems to be another place to prove it.
Http://developer.android.com/reference/java/util/regex/Pattern.html
Escape sequences
\Quote the following metacharacter (so \. matches a literal .).
\Q Quote all following metacharacters until \E.
\E Stop quoting metacharacters (started by \Q).
\\ A literal backslash.
\uhhhh The Unicode character U+hhhh (in hex).
\xhh The Unicode character U+00hh (in hex).
\cx The ASCII control character ^x (so \cH would be ^H, U+0008).
\a The ASCII bell character (U+0007).
\e The ASCII ESC character (U+001b).
\f The ASCII form feed character (U+000c).
\n The ASCII newline character (U+000a).
\r The ASCII carriage return character (U+000d).
\t The ASCII tab character (U+0009).

Escape character This paragraph is obviously not a UTF-32 character.
\uhhhh But this is UTF-16 can be represented and using two \uhhhh can represent all Unicode code points
UTF-8 at least a 0 in front of me is the code of UTF-8.

A few meditations

    1. What utf-16be or Utf-16le is used by the Android Dalvik JVM?
    2. Whether the JVM uses UTF-16 as the JVM specification or a variety of virtual machines ' own standards
    3. Does Android class use the UTF-16 runtime to convert using UTF-8 when running to a virtual machine?
    4. Android log output using the UTF-8 JVM is it not necessary to do a conversion? What is the efficiency of the consideration?

You are welcome to criticize correct

The relationship between Android Java Unicode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.