When a Unicode string is written into a text file or other storage, the Unicode scalar in the string will be encoded in several encoding formats defined by Unicode. The small block code in each string is called a code unit. These include UTF-8 encoding format (encoding string is 8-bit code unit), UTF-16 encoding format (encoding string 16-bit code unit), and UTF-32 encoding format (encoding string 32-bit code) unit).
Swift provides several different ways to access the Unicode representation of strings. You can use for-in to traverse the string, so as to access each Character value in the form of Unicode scalable character clusters. This process is described in Using Characters.
In addition, the value of the string can be accessed in three other Unicode compatible ways:
UTF-8 code unit collection (using the utf8 attribute of the string to access)
UTF-16 code unit collection (using the utf16 attribute of the string to access)
21-bit Unicode scalar value set, which is the UTF-32 encoding format of the string (use the unicodeScalars attribute of the string to access)
Each character in the following string consisting of D``o``g``? (DOUBLE EXCLAMATION MARK, Unicode scalar U + 203C) and? (DOG FACE, Unicode scalar U + 1F436) represents a different 'S representation:
let dogString = "Dog ???"
UTF-8 means
You can access its UTF-8 representation by traversing the utf8 attribute of String. It is a property of type String.UTF8View. UTF8View is a collection of unsigned 8-bit (UInt8) values. Each UInt8 value is a UTF-8 representation of a character:
Character D
U + 0044 o
U + 006F g
U + 0067?
U + 203C ??
U + 1F436
UTF-8
Code Unit 68 111 103 226 128 188 240 159 144 182
Position 0 1 2 3 4 5 6 7 8 9
for codeUnit in dogString.utf8 {
print ("\ (codeUnit)", terminator: "")
}
print ("")
// 68 111 103 226 128 188 240 159 144 182
In the above example, the first three decimal codeUnit values (68, 111, 103) represent the characters D, o, and g, and their UTF-8 representation is the same as ASCII representation. The next three decimal codeUnit values (226, 128, 188) are the 3-byte UTF-8 representation of DOUBLE EXCLAMATION MARK. The last four codeUnit values (240, 159, 144, 182) are the 4-byte UTF-8 representation of DOG FACE.
UTF-16 means
You can access its UTF-16 representation by traversing the utf16 attribute of String. It is a property of type String.UTF16View. UTF16View is a collection of unsigned 16-bit (UInt16) values. Each UInt16 is a UTF-16 representation of a character:
Character D
U + 0044 o
U + 006F g
U + 0067?
U + 203C ??
U + 1F436
UTF-16
Code Unit 68 111 103 8252 55357 56374
Position 0 1 2 3 4 5
for codeUnit in dogString.utf16 {
print ("\ (codeUnit)", terminator: "")
}
print ("")
// 68 111 103 8252 55357 56374
Similarly, the first three codeUnit values (68, 111, 103) represent the characters D, o, and g, and their UTF-16 code units are identical to UTF-8 (because these Unicode scalars represent ASCII characters).
The fourth codeUnit value (8252) is a decimal value equal to 203C in hexadecimal. This represents the Unicode scalar value U + 203C of the DOUBLE EXCLAMATION MARK character. This character can be represented by a code unit in UTF-16.
The fifth and sixth codeUnit values (55357 and 56374) are UTF-16 representations of DOG FACE characters. The first value is U + D83D (decimal value 55357), the second value is U + DC36 (decimal value 56374).
Unicode Scalars Representation
You can access its Unicode scalar representation by traversing the unicodeScalars property of the String value. It is an attribute of type UnicodeScalarView, which is a collection of UnicodeScalar. UnicodeScalar is a 21-bit Unicode code point.
Each UnicodeScalar has a value attribute, which can return the corresponding 21-bit value, represented by UInt32:
Character D
U + 0044 o
U + 006F g
U + 0067?
U + 203C ??
U + 1F436
UTF-16
Code Unit 68 111 103 8252 128054
Position 0 1 2 3 4
for scalar in dogString.unicodeScalars {
print ("\ (scalar.value)", terminator: "")
}
print ("")
// 68 111 103 8252 128054
The value attributes of the first three Unicode Scalar values (68, 111, 103) still represent the characters D, o, and g. The fourth codeUnit value (8252) is still a decimal value equal to hexadecimal 203C. This represents the Unicode scalar U + 203C of the DOUBLE EXCLAMATION MARK character.
The value attribute of the fifth Unicode Scalar value, 128054, is a decimal representation of hexadecimal 1F436. It is equivalent to DOG FACE's Unicode scalar U + 1F436.
As an alternative to querying their value attributes, each Unicode Scalar value can also be used to construct a new String value, such as used in string interpolation:
for scalar in dogString.unicodeScalars {
print ("\ (scalar)")
}
// D
// o
// g
//?
// ??
Swift study notes-Strings and Characters-Unicode Representations of Strings