Chinese character storage format in SIM card

Last Update:2014-11-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Ucs2 format in SIM card

The text in the SIM card is stored in the ucs2 format. ucs2 and Unicode are only in different byte order, Unicode is a small header, and ucs2 is a big header.

The widechartomultibyte and multibytetowidechar functions in VC can be used for interchange between ucs2 and gb2312.

Ucs2 itself has three formats, commonly used is the 80 format, that is, the start of 80, each two bytes represents a character, there are 8, 82 format, the latter two can be expressed in one byte a Chinese character. 80, 81, 82, and gb2312 can be exchanged under specific conditions.

Below are some brief explanations of the specifications

Annex B (Normative ):
Coding of Alpha fields in the sim for ucs2

If 16 bit ucs2 characters as defined in ISO/IEC 10646 [31] are being used in an alpha field, the coding can take one of three forms. if the me supports ucs2 coding of Alpha fields in the sim, the me shall support all three coding schemes for character sets containing 128 characters or less; for character sets containing more than 128 characters, the me shall at least support the first coding scheme. if the alpha field record contains GSM default alphabet characters only, then none of these schemes shall be used in that record. within a record, only one coding scheme, either GSM default alphabet, or one of the three described below, shall be used.

1) if the first octet in the Alpha string is '80', then the remaining octets are 16 bit ucs2 characters, with the more significant octet (MSO) of the ucs2 character coded in the lower numbered octet of the alpha field, and the less significant octet (LSO) of the ucs2 character is coded in the higher numbered alpha field octet, I. e. octet 2 of the alpha field contains the more significant octet (MSO) of the First ucs2 character, and octet 3 of the alpha field contains the less significant octet (LSO) of the first ucs2 character (as shown below ). unused octets shall be set to 'ff ', and if the alpha field is an even number of ETS in length, then the last (unusable) octet shall be set to 'ff '.

Example 1

Octet 1	Octet 2	Octet 3	Octet 4	Octet 5	Octet 6	Octet 7	Octet 8	Octet 9
'80'	Ch1mso	Ch1lso	Ch2mso	Ch2lso	A7mso	Stclso	'Ff'	'Ff'

This means that the data starting with 80 is in the ucs2 format, where the big data is in the front, the small data is in the back, and the unused bytes are filled with ff.

For example, the Chinese character "China", its

The gb2312 internal code is d6d0b9fa,

It is expressed as 4e2d56fd in the ucs2 80 scheme.

2) If the first octet of the Alpha string is set to '81 ', then the second octet contains a value indicating the number of characters in the string, and the third octet contains an 8 bit number which defines bits 15 to 8 of a 16 bit base pointer, where bit 16 is set to zero, and bits 7 to 1 are also set to zero. these sixteen BITs constitute a base pointer to a "half-page" in the ucs2 code space, to be used with some or all of the remaining octets in the string. the fourth and subsequent octets in the string contain codings as follows; If bit 8 of the octet is set to zero, the remaining 7 bits of the octet contain a GSM default alphabet character, whereas if bit 8 of the octet is set to one, then the remaining seven bits are an offset value added to the 16 bit base pointer defined earlier, and the resultant 16 bit value is a ucs2 Code Point, and completely defines a ucs2 character.

Example 2

Octet 1	Octet 2	Octet 3	Octet 4	Octet 5	Octet 6	Octet 7	Octet 8	Octet 9
'81'	'05'	'13'	'53'	'95'	'A6'	'Xx'	'Ff'	'Ff'

In the above example;

-Octet 2 indicates there 5 Characters in the string.

-Octet 3 indicates bits 15 to 8 of the base pointer, and indicates a bit pattern of 0hhh hhhh h000 0000 as the 16 bit base pointer number. bengali characters for example start at code position 0980 (0000 1001 1000 0000), which is indicated by the coding '13' in octet 3 (shown by the italicised digits ).

-Octet 4 indicates GSM default alphabet character '53', I. e. "S ".

-Octet 5 indicates a ucs2 character offset to the base pointer of '15', expressed in binary as follows 001 0101, which, when added to the base pointer value results in a sixteen bit value of 0000 1001 1001 0101, I. e. '20140901', which is the Bengali letter ka.

-Octet 8 contains the value 'ff ', but as the string length is 5, this a valid character in the string, where the bit pattern 111 1111 is added to the base pointer, yielding a sixteen bit value of 0000 1001 1111 1111 for the ucs2 character (I. e. '09ff ').

In this section, there is a base address in the 81 format, and then a ucs2 is represented in a byte. If you want to display ucs2, you must first calculate the base address, then a 16-bit ucs2 80 format code is calculated for each byte.

With the 80 format code, it is easy.

In terms of format, 81 is the identifier, followed by the length of a byte, followed by the base address. The base address needs to be shifted seven places to the left, and the low and high positions are set to 0. For more information, see the English, the last is data.

Due to the format restrictions, the 81 format only represents 255 characters, and the 255 characters are in ucs2 80 encoding, up to 127 English and 128 Chinese characters, in addition, the encoding of the 128 Chinese ucs2 80 format must be within the adjacent 128 ranges.Because the Chinese language can only be expressed in 80-ff, so it can accommodate a maximum of 128 Chinese and 127 English letters, so a value of 30 and 80 is different in the processing method, 30 directly indicates '0 ', the base address is used for calculation for 80 (the same is true for the 82 format)

For example, a Chinese character can be entered, and the Chinese character can be found everywhere.

Gb2312 internal code d2bbb6a18140c6df814181428143cdf2d5c9c8fd

80 format encoding 4e004e014e024e034e044e054e064e074e084e09 (continuous)

81 encoding 0a9c80818283848586878889 (continuous)

3) if the first octet of the Alpha string is set to '82 ', then the second octet contains a value indicating the number of characters in the string, and the third and fourth octets contain a 16 bit number which defines the complete 16 bit base pointer to a "half-page" in the ucs2 code space, for use with some or all of the remaining octets in the string. the th and subsequent octets in the string contain codings as follows; If bit 8 of the octet is set to zero, the remaining 7 bits of the octet contain a GSM default alphabet character, whereas if bit 8 of the octet is set to one, the remaining seven bits are an offset value added to the base pointer defined in octets three and four, and the resultant 16 bit value is a ucs2 Code Point, and defines a ucs2 character.

Example 3

Octet 1	Octet 2	Octet 3	Octet 4	Octet 5	Octet 6	Octet 7	Octet 8	Octet 9
'82'	'05'	'05'	'30'	'2d'	'82'	'D3'	'2d'	'31'

In the above example

-Octet 2 indicates there are 5 Characters in the string.

-Octets 3 and 4 contain a sixteen bit base pointer number of '000000', pointing to the first character of the Armenian character set.

-Octet 5 contains a GSM default alphabet character of '2d ', which is a dash "-".

-Octet 6 contains a value '82 ', which indicates it is an offset of '02' added to the base pointer, resulting in a ucs2 character code of '100 ', which represents Armenian character capital Ben.

-Octet 7 contains a value 'd3 ', an offset of '53', which when added to the base pointer results in a ucs2 code point of '123 ', representing Armenian Character Small piwr.

The encoding format is similar to 81. The difference is that 81 represents the base address in one byte, and 82 represents the base address in two bytes.

For example, a Chinese character can be entered, and the Chinese character can be found everywhere.

Gb2312 internal code d2bbb6a18140c6df814181428143cdf2d5c9c8fd

80 format encoding 4e004e014e024e034e044e054e064e074e084e09 (continuous)

81 encoding 0a9c80818283848586878889 (continuous)

82 code 0a4e0080818283848586878889 (continuous)

Error Correction

128 encoding, which contains a maximum of Chinese characters and 0-English letters, can all be in English.

Chinese character storage format in SIM card

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Chinese character storage format in SIM card

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Chinese character storage format in SIM card

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support