ANSI/UTF-8/ucs2 (UTF-16), and carriage return line feed [zz]

Last Update:2018-12-06 Source: Internet

Author: User

Tags control characters ultraedit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Address: http://blog.csdn.net/ab6326795/article/details/7901915

I recently encountered a data loading failure problem caused by invisible character (0x1d) on a Linux platform. I would like to take this opportunity to sort out the knowledge about character encoding.

Carriage return/line feed:

========================

As the name implies, carriage return and line feed are two different control characters:

-Press enter (carriage return), that is, \ r, ASCII code 13 (0x0d), to move the cursor to the starting position of a row

-Linefeed: \ n, ASCII code 10 (0x0a). It is used to move the cursor to the next line.

On different operating system platforms, different controllers are used by default to mark the end of a row:

-Windows: \ r \ n

-Linux/Unix: \ n

-Mac: \ r (it is said that it has recently been changed to \ n)

The result of different implementations is that the standard text file on winodws will have an extra ^ m Controller on other platforms, while the files on other platforms will have only one line on Windows. Linux has the dos2unix/unix2dos command to solve the Text wrap problem.

Character encoding:

========================

Text is a human language. For computers, the language is only 0 and 1. Therefore, saving characters in a computer involves character encoding. There are manyArticleDifferent from the blog's character encoding method, it will not be referenced here. Key points I understand:

ASCII: ANSI

The original computer storage is limited. The character storage uses 8-bit ASCII code (the highest bit is 0) and can store up to 128 characters, plus the extended ASCII code (the highest bit is 1) it can only store 256 characters.All countries have developed their own ascii-compatible code specifications, namely various ANSI codes, to express different texts in different languages. Therefore, the same ASCII code can represent different characters in different character sets (collation/charset/codePage, this is why ANSI characters must be bound to a specific collation to indicate the only correct character. If an incorrect character is bound, garbled characters may occur (the characters with ASCII codes less than 128 are not garbled ).

Unicode

Also known as Wanguo code, all characters in all languages can be identified by a unique Unicode code. Unicode can be implemented in the following ways:

-UTF-8: According to the byte (8 bits) to store encoding, variable length (1 ~ 6 bytes ). Fully compatible with ASCII codes, that is, the characters in the standard ASCII code are also represented by the same code in Unicode. UTF-8Use the first byte to determine the number of bytes: the first byte is 0, that is, one byte, 110 is 2 bytes, and 1110 is 3 bytes. Subsequent byte values start with 10.

-UTF-16 (ucs2): stores characters in double byte, so there is a byte order problem: either the High Front (big endian) or the low front (littleendian ). This behavior is related to the way the CPU processes bytes. Generally, littleendian is used more. For example, if the Unicode code is 0x6c49, the data is stored as 6c49 by big endian, And the littleendian value is 496c.

How to differentiate files of different codes:

Open a text file with ultraedit or similar Editor, switch to hexmode (CTRL + H), and view the file header:

-If there is no special file header, the first character is the text content, which is an ANSI file.

-Unicode files starting with BOM (byte order mark. BOM can be 0 xfeff (bigendian), 0 xfffe (little endian), or0 xefbbbf (UTF-8 ).

Note that saving a UTF-8 file with Notepad on Windows, viewing with ultraedit may also start with 0xfffe because ultraedit may automatically convert the UTF-8 file to the UTF-16 format. It is not safe to use ultraedit to open a file and view BOM to determine the file format. The status bar below ultraedit displays the actual encoding format of the opened file, rather than the encoding format currently edited. For a common ASCII file, it is displayed as DoS or UNIX, for a file containing UTF-8-encoded characters, it is displayed as a U8-DOS or U8-UNIX, for a UTF-16-encoded file, it is displayed as a U-DOS or U-Unix.

We know that the UTF-8 for ASCII character encoding is consistent with the original ASCII encoding, so if we delete all characters other than ascii in a UTF-8DOS file, save and open again, ultraedit is displayed as DoS (ASCII ).

If we don't want ultraedit to automatically convert to UTF-8 format editing when opening the UTF-16 file, we can modify the configuration. For example, make sure that automatic detection of UTF-8 files is not selected.

Note that if you cancel this option, opening a file containing UTF-8 encoding by ultraedit produces garbled characters.

Ultraedit
The file-convertions Menu provides conversion between multiple encoding formats, which affects the encoding of saved files. After conversion, you can see the corresponding changes in the status bar. After some options
Unicodeediting or ASCII editing, which specifies the encoding used for editing and does not affect the encoding used for saving files.

The tool winhex can be used to view the file's hexadecimal internal code.

Unicode and SQL Server:

======================

The UTF-8 is currently used most
Extensive unicode encoding formats, but sqlserver uses UTF-16 as the encoding format for Unicode characters, that is, nchar/ntext
/Nvarhcar stores double-byte Unicode characters. A string prefixed with N (for example, N 'hello') is a double-byte UNICODE character. For UTF-
8 string, if it contains only ANSI characters, can be processed according to the ANSI string; otherwise, the UTF-8 string must be converted to a UTF-16, and then handed to sqlserver
Processing. This behavior may cause two common problems:

-ManyProgram(Including ASP. NET) Use UTF-8 format when outputting files. If these files need to be correctly processed by sqlserver, they must be transcoded to the UTF-16.

-SQL 2005 and later support nativexml processing, but if the XML document itself is UTF-8 encoding, there is no way to deal with it directly. The following is an example:

Run the following in SSMs:Code, Error 9420 is returned:

Declare @ infxml
Set @ INF = '<? XML version = "1.0" encoding = "UTF-8"?>
<Root>
<Names>
<Name> Zhang San </Name>
</Names>
<Names>
<Name> Li Si </Name>
</Names>
</Root>
'

Select X. Value ('name [1] ', 'varchar (10)') asname
From @ INF. nodes ('/root/names') as T (X)

Even if the @ INF value is forcibly specified to convert the N prefix to Unicode, because the XML document itself specifies the UTF-8 encoding, the solution is to specify the UTF-16 encoding in the XML document and specify the N Prefix:

Declare @ infxml
Set @ INF = N '<? XML version = "1.0" encoding ="UTF-16 "?>
<Root>
<Names>
<Name> Zhang San </Name>
</Names>
<Names>
<Name> Li Si </Name>
</Names>
</Root>
'

Select X. Value ('name [1] ', 'varchar (10)') asname
From @ INF. nodes ('/root/names') AST (X)

About invisible characters 0x1d

================================

This character is group
The role of separator is grouping. How to use this control character depends on the application. In MSSQL and SSIs, this character does not have a special meaning; however, netezza does
The character is used as a separator, leading to a partition error during data import. Unfortunately, the batch cannot be controlled during the import of netezza's externaltable.
Size. For control characters, you can add ctrlchars When referencing externaldatasource
If this parameter is set to true, the control character can be input as the field content.

Reprinted from: http://blog.sina.com.cn/s/blog_719ba7710100znys.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More