It glossary-Compilation

Source: Internet
Author: User

32-bit

The number of digits in a computer refers to the maximum number of digits that the CPU can process at a time. The CPU of a 32-bit computer can process up to 32-bit data at a time. For example, its eax register is 32-bit. Of course, a 32-bit computer can also process 16-bit and 8-bit data. When Intel upgraded from 16-bit 286 to 386, to be compatible with the 16-bit system, it first introduced ipvsx, which has a CPU budget of 32 bits, the external data transmission is 16 bits. After mongodx, all CPUs are 32-bit internally and externally.

Bit

In a binary system, each 0 or 1 is a bit, which is the minimum unit of memory.

Byte bytes

A byte is composed of eight digits and can represent one character (~ Z), number (0 ~ 9), or symbol (,.?! % & +-*/) Is the basic unit for storing data in memory.

1 byte = 8 bit

1 kb = 1024 bytes = 210 bytes

1 MB = 1024 kb = 220 bytes

1 GB = 1024 MB = 230 bytes

Word

1 word is 16 characters, which is 2 bytes
1 byte is 8 bits and 8 bits

ASCII code

In view of the importance of information exchange and the encoding standard for unified text symbols, computers of different brand models can use the same set of standardized information exchange codes, as a result, the US National Bureau of Standards specially formulated the ASCII code (America)
Standard Code for information interchange, American Information Exchange Standard Code), as the standard code for data transmission. Early use 7
One digit represents English letters, numbers 0 ~ 9. Currently, 8 characters are used to represent 256 different texts and symbols, which is the most common and widely used standard English code in various computer systems, compared with ASCII
Code, the most widely used internal code in the Chinese system is big-5 code.

ANSI Encoding

Both Unicode and ANSI are charactersCode.
To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range to indicate 1
Characters. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage.
Different countries and regions have developed different standards, resulting in gb2312,
Big5, JIS, and Other encoding standards. These two bytes are used to represent the extended Chinese character encoding methods of a single character. They are called ANSI encoding. In a simplified Chinese system, ANSI encoding represents
Gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.
Different ANSI
Encoding is incompatible. When information is exchanged internationally, the text in the two languages cannot be stored in the same ANSI encoded text.

Unicode code

Unicode makes it easier for machines to accept characters in any language. Unicode is managed by the UC (UNICODE Association) and is technically modified. Unicode support is required for technical standards such as Java, LDAP, and XML. Unicode characters are converted into code points (
Points), which is represented by XXXX after U, where X is a hexadecimal character.

ASCII code for English
0-127 is sufficient for all characters in the Code. For Chinese characters, two bytes (bytes) must be used to represent one character. The first byte must be greater than (so we haveProgramThe criterion for determining the Chinese text is that the ASCII code is greater than 127 ). The preceding two bytes are used to represent a Chinese character. In practice, they are called dual-byte (that is, DBCS:
Double-byte character set). In contrast, an English character code is called a single-byte sbcs (single-byte character
Set ).

Although the dual-byte (DBCS) is sufficient to solve the mixed use of Chinese and English characters, it is very troublesome for different character systems to undergo bytecode conversion. For example, Chinese and English, Japanese, and Korean. To solve this problem, ISO/IEC was established in April 1984.
JTC1/SC2/WG2 workgroup. Unified encoding of texts and symbols in different countries. In 1991, a U.S. multinational company set up Unicode
Consortium. And reached an agreement with WG2 on October 1991. Use the same encoding word set. Currently, Unicode uses a 16-bit encoding system. The character set content is similar to the BMP (basic) of iso000046.
Multilingual plane) is the same. Unicode passed DIS (draf International
Standard ). The current version V2.0 was released in 1996. The content contains 6811 symbols. There are 20902 Chinese characters. 11172 in Korean and pinyin. There are 6400 word-building areas. Retain 20249. A total of 65534.

With the rapid development of the Internet. The demand for data exchange is growing. Different coding systems are increasingly becoming an obstacle to information exchange. In addition, documents that coexist in multiple languages are increasing. It is difficult to solve these problems by using the code page alone. Unicode came into being.

Unicode has a double meaning. Unicode is an international standard for ISO/iec000046 encoding. It is also called a big character set. It is an important international standard promulgated by ISO in 1993. Its purpose is to unify the coding of all types of languages around the world ). In addition, it is also the name of a consortium group consisting of large enterprises such as HP, Microsoft, IBM, and apple in the United States. The purpose of the Group is to promote the uniform multi-text encoding.

Unicode is the most significant difference from the popular code page: Unicode is the full encoding of two bytes. ASCII characters are also expressed in two bytes. The code page is determined by the value range of the high byte to be an ASCII character. Or the high byte of Chinese characters. If data corruption occurs. Some content is damaged. This will cause confusion of the subsequent Chinese characters. Unicode uses two bytes to represent one character. The most obvious advantage is that it simplifies the processing of Chinese characters.

Unicode uses a plane to describe the encoding space. Each plane is divided into 256 rows. 256 columns. It is two bytes higher than the two-byte encoding.

The first plane of Unicode. Called Basic multilingual
Plane ). BMP for short. Because BMP is represented in only two bytes. So it is favored.

The initial objective of Unicode. It uses a 16-bit encoding to provide ing for over 65000 characters. But this is not enough. It cannot cover all historical texts. Implantation
Head-ache's ). Especially for network-based applications. Therefore. Unicode uses three encoding methods based on some basic reserved characters. They are UTF-8, UTF-16, and UTF-32 respectively. As shown in the name. In UTF-8. The characters are encoded in 8-bit sequence. Represents a character in one or several bytes. The biggest benefit of this method. Is that the UTF-8 retains the ASCII character encoding as part of it. For example. In UTF-8 and ASCII. "A" encoding is 0x41. The UTF-16 and UTF-32 are Unicode 16-bit and 32-bit encoding methods, respectively. Taking into account the initial purpose. Unicode generally refers to the UTF-16.

Over the years. American Standard Code for information
Interchange (ASCII Code) is used to represent characters. These characters can be letters. Number. Punctuation and control. It is not a problem to use this encoding to indicate English characters. However, it indicates other languages, such. Arabic. Chinese. Japanese. Wei Wen. Havin... Must be expanded. In May 1987. Xerox
Joe Becker and Lee Collins at the Palo Alto Research Center. And Apple's mark
Davis tried to study a character encoding suitable for multi-text processing. This encoding was quickly supported by many large companies. These companies all sent representatives to the Unicode Research Group. Unicode research has made rapid progress. The Unicode group is a member of the world's leading system and software manufacturer. So Unicode soon became the de facto industrial standard.

Unicode-based systems can use 65000 different characters. It is good enough to cover all the letters in all languages of the world. Add thousands of symbols.

. General
The scripts area contains 19 languages. Including ASCII, Latin1, Greek, Cyrillic, Armenian, hedrew, Arabic, Devanagari, Bengali
, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, and Georgian. It also includes Chinese. A large number of characters in Japanese and Korean.

Unicode is a fixed-length 2B multi-text character set encoding. It tries to improve the existing standards of relevant countries and regions. Including gb2312, cns000043, JIS 0208, and KSC
5601. Unicode can represent mixed text. It can also ensure the previous ISO 10646.

Unicode features:

The encoding code of any country is expressed in two bytes. For example, "A" is hexadecimal in Unicode.
Combination of 41 and 00, that is, 4100, high 41 (converted to ASCII code is 65 = A), Windows NT/2000 represents the character set in UNICODE, for example, you can see MS SQL
The SQL files generated in server can be saved in Unicode or in normal format. If you save them in UNICODE, many software on the 95/98 platform cannot read the format correctly.

Unicode is easy to be compatible with ASCII. A byte equal to 0 before ASCII is a Unicode character.

You can also note that in API definition 95/98, many names end with a, for example

Writeprofilestringa

In the NT/2000 operating system, two APIs are provided. The other command is writeprofilestringw. The API ending with W is only applicable to NT/
2000. (Using the API function ending with W in NT is faster than that ending with a, because Unicode and DBCS/sbcs conversion processes are not required)

In this way, we often use the function to judge the string length. The execution results in NT and 95/98 are different, as follows: (the following code is suitable for VB and Asp)

Medium 95/98:

Len ("ABC China") returns 7 (because each Chinese character serves as two ASCII codes)

NT/2000:

Len ("ABC China") returns 5 (because each character is considered Unicode)

Uft8

Utf8 is a storage and transfer format. As mentioned above, each Unicode/UCOS character is stored in 2 or 4 bytes. Let's take a look at the comparison below:

Take "I am Chinese" as an Example
Storage with ANSI: 12 bytes
Save with Unicode/ucs2: 24 bytes + 2
Bytes (header)
Stored with ucs4: 48 bytes + 4 bytes (header)

Take "I am a Chinese" as an Example
Storage in ANSI: 10 bytes
Save with Unicode/ucs2: 10 bytes + 2
Bytes (header)
Stored with ucs4: 20 bytes + 4 bytes (header)

It can be seen that it is a great waste to store in the original form of Unicode/UCOS, and it is not conducive to Internet transmission.

See and this, Unicode/UCS compression form -- utf8 appeared, apply the official website's first sentence "UTF-8 stands for Unicode
Transformation format-8. It is an octet (8-bit) lossless encoding of Unicode
Characters. ", because UTF is also applicable to the encoding of UCs, it is also known as" uctransformation formats (UTF )』

Utf8 is the most basic unit of 8 bits (1bytes) encoding. Of course, it can also be in the form of 16 bits and 32 bits, which are called UTF16 and UTF32 respectively, but it is not used much currently, utf8 is widely used in file storage and network transmission.

UCs

The International Standard ISO 10646 defines the universal character set (UCS ). UCs
Is a superset of all other character sets. It ensures bidirectional compatibility with other character sets. That is, if you translate any text string
The source code is translated back to the source code. You will not lose any information.

In addition to assigning a code to each character, the UCOS also gives a formal name. It indicates the hexadecimal number of a UCOS or Unicode value. Generally, "U +" is added before it, just like
U + 0041 represents the character "Latin uppercase letter ". The UCS character U + 0000 to U + 007f is consistent with the US-ASCII (ISO 646), u + 0000
U + 00FF and ISO 8859-1 (Latin-1) are also consistent. From U + e000 to U + f8ff, BMP has been
Outside of the large range of encoding is reserved for private use.

Segment

There are several segments in the assembly language. One is that a segment (such as a code segment) contains specific content, such as a variable. Each such resource has its own address, the segment address is the beginning of each segment, and an offset address is used to find this variable. The segment address refers to the segment address in the memory, and the offset refers to the offset of a statement in the code segment relative to the segment address. Physical address = segment address * 10 h + offset address (10 h
= 16D), that is to say, the segment address shifts four places to the left (in hexadecimal format), and The plus offset address is the physical address; 3017: the physical address of 000a is 3017 H * 10 h + 000ah = 3017ah.

8086 and earlier CPU addressing range is 1 MB, 80386 and later CPU addressing range is 4 GB

B-binary
O-octal
D-decimal
Decimal
H-hexadecimal

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.