Files under Windows are displayed in Fedora

Source: Internet
Author: User

Directory

Some preliminary knowledge :

The encoding of windows in China is local code, i.e. gbk,gb2312,gb18030, etc.

GBK is windows-986.

Windows now supports only Unicode (UTF-16) and no longer supports ANSI (because of the lack of uniformity in national and regional standards, resulting in inconvenient conversions)

is the encoding in the Android system changed according to different regional laws?

When Ascil, the character set and encoding are not distinguished.

A character set (char set) is a collection of characters that contains a certain number of characters. Each character has a corresponding ID value, called a code point. The actual storage is not necessarily the code point to store the string directly (for example, in order to save space), to be converted. This conversion rule is coding.

How many character sets do you have?

Noun Analysis:

Character encoding

Code Point/code position

Character Map Character Map

The BOM (byte order mark), which appears in the header of a text file, is used to identify the encoding in which the file is in the format of the document.

The UCS specification recommends that the character "Zero Width no-break Space" be transmitted before the byte stream is transmitted. This means that if the recipient receives FEFF, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "zero width No-break space" ("0 wide uninterrupted interval") is also called a BOM.

ANSI ANSI Encoding

ASCII gbk2312 Big5 is a kind of ANSI

Incompatibility between different ANSI encodings

UCS Unicode Character Set

UTF-8 UTF means universal Character set conversion format (Universal Character set Transformation format)

Coding scheme

UTF-7, UTF-8, and UTF-16 are Unicode-based encoding schemes.

Unicode Universal multiple-octet Coded Character Set

Range 0-0x10ffff:1114112 characters (17*65536:17 plane, each plane has 65,536 characters)

There are nearly 100,000 characters defined, of which more than 70,000 are Chinese characters

  

Character set UTF-8 encoded with this character set, UTF-16 is also used with this character set.

MBCS (Multi-Byte chactacter system, or multibyte character systems) it is a type of encoding, not a name for a particular encoding.

UTF-16, like UTF-8, is a variable-length encoding. (They become longer "granularity" is different, increase in 8-bit increments and in 16-bit increments.), and so on, UTF-32 at least 32 bits to indicate

Because at least two bytes are represented by one character, UTF-16 is not compatible with ASCII (one byte)

Code bit: A number that can be assigned to a character.

Ucd:unicode character databases (Unicode Character database)

Block script

Big Endian and Little Endian

  

Utf-8 no problem with size and end?

Because UTF-8 is a prefix code, UTF-16 has this problem. BOM markings are required.

Command:

Locale

Locale-a

Problem

Can an article be encoded in two ways (for example, the first part is UTF-8 and the back is GBK18030)?  If this does not conflict, how is the encoding decoded? (or is it clearer that both codes can be mixed?)

Some tools:

View the actual encoding of the character. The corresponding binary data.

Tools under Linux:

Scope of the locale command, exception (using the specified locale in an app or path)

Character set: en_US. Does UTF-8 support Chinese?

en_US. UTF-8 represents the character set and ZH_CN. UTF8 are UTF-8, the front of en_US or ZH_CN just tell the system, what language you speak, in which region.

UTF-8 is the encoding, which belongs to the Unicode character set.

decoding, encoding process?

UTF-16 divided into big and small end, UTF-8 it?

UTF-8

A binary code, how to determine whether it is GBK or BIG5 or utf-16/utf-8?

The practice of Notepad is to save a label at the very front of the TXT file, and if Notepad opens a txt and finds this tag, it is Unicode. The label is called BOM, if it is 0xFF 0xFE, is Utf16le, if 0xFE 0xFF is utf16be, if it is 0xEF 0xBB 0xBF, it is UTF-8. Without these three things, it is ANSI, which is interpreted using the default language encoding of the operating system.  (Link: https://www.zhihu.com/question/20650946/answer/15751688) (http://blog.csdn.net/xiongxiao/article/details/3741731) Input method implementation? Open Notepad, enter some characters, do you know how to encode it when you save it? If so, what is the term for the content in Notepad before saving?

Utilization

How do you design a code?

Input Method Design

Encoding Type conversions:

  

============================================================================================================

Some of the relationships that can help understand a series of related concepts, such as character sets, are explained:

Four-layer model, five-layer model

(In fact, most code pages do not require a complete four-layer model, for example, GB18030 in bytes, directly specify the byte sequence and character mapping, skip the second layer, and do not need the fourth layer.) )

"The complete model needed to describe character sets and encodings"

A character set encoding involves at least-a set of characters, and a system for their encoded REPRESENTAT Ion in the computer.

A more complete model was actually necessary, involving four different levels of representation:the abstract character repertoire, the coded character set, the character encoding form, and the character encoding scheme.

4 x

1. Abstract character The range of repertoire character

The abstract character plane

2. coded character Set coded character set with numeric representation of characters Utf-8

The character number in mathematical sense

For example Unicode, a character that has not been converted to a specific stored content before the number

3. character encoding form character encoding format          representing characters with basic data types

Programmer's Perspective: bytes

  Actual storage regardless of size end

4. character encoding scheme character encoding scheme as character in byte stream

The actual representation of the binary : bytes actually stored in the computer (be/le)

UTF-16, UTF-32 and so on need this layer size end

(

Character Set Encoding Basics

Http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=iws-chapter03

talking about text encoding and Unicode

Http://www.fmddlmyy.cn/text16.html

Http://www.fmddlmyy.cn/text16.html

A tutorial on character code issues

https://www.cs.tut.fi/~jkorpela/chars.html)

Concepts in the principle of computer composition:

  

Files under Windows are displayed in Fedora

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.