Delphi and character encoding (actual combat) (MultiByteToWideChar returns the converted wide string length)

Source: Internet
Author: User

Objective of this article:

    • Understanding the string type of Delphi
    • Detection and conversion of character encoding
    • Simplified traditional Conversion
0. Introduction

Read ". NET and character encoding (theory), we understand that characters are the smallest unit of natural language, and can be used in three encoding methods during storage and transfer: ASCII, DBCS, and Unicode. Common DBCS encodings are GB2312, GBK, and BIG5, while UTF-8, UTF-16, and UTF-32 are the most commonly used Unicode encoding types.

1. String type

There are two types of strings in Delphi:ansistring and widestring. Ansistring is called a long string, andwidestring is called a "wide string" (AUnicodestring) and a COM string (aBSTR ) compatible. They are both allocated by the program on the heap and automatically manage the allocation and deallocation of memory. Currently on the Win32 platform, thestring type is equivalent to ansistring. Ansistring can also be understood as a byte sequence that supports single-byte character encoding (SBCS), multibyte character encoding (MBCS/DBCS), and UTF-8 encoding. and widestring uses UTF-16 encoding, which perfectly supports Unicode.

To illustrate the difference between characters and bytes, let's look at an example of how many characters are counted:

Assuming the current system page is CP936 (GBK 1.0)
Procedure Testansilength;
Var
str:string;
Begin
str: = ' chinese abc ';
Assert (Length (str) = 7); 7 bytes
Assert (ansilength (str) = 5); 5 characters
End


Here are two implementations of Ansilength:

Uses sysutils;
function ansilength (const s:string): integer;
Var
P, Q:pchar;
Begin
Result: = 0;
P: = PChar (s);
Q: = p + Length (s);
While P < Q do
Begin
INC (Result);
If p^ in leadbytes then//the leading byte array of the current system code page
INC (P, 2)
Else
INC (P);
End
End


Uses Windows;
function ansilength (const s:string): Integer;
Begin
Result: = MultiByteToWideChar (CP_ACP, 0, Pansichar (s),-1, nil, 0);
If result > 0 then DEC (result); Drop Terminator
End


If you understand it. NET and character encoding (theory) in the coding knowledge, the above example is very simple.

2. Detection and conversion of character encoding

"工欲善其事, its prerequisite", I recommend some tools to you first:

    • JCL (JEDI Code Library)
    • Virtual TreeView
    • Tnt Controls or TMS Unicode Component Pack

Define the basic types:

{Encoding Type}
Tencodingtype = (
Etansi,//ANSI format (SBCS/DBCS)
EtUTF8,//UTF-8 format
Etunicode,//UTF-16 format using little endian
Etunicodebe,//UTF-16 format using big endian
ETUTF32,//UTF-32 format using little endian
ETUTF32BE//UTF-32 format using big endian
);

{byte order mark}
Tbyteordermask = array of Byte;


To obtain a BOM of different encoding types:

copybytes



function Trygetbom (const encodingtype:tencodingtype; var bom:tbyteordermask): Boolean;
Begin
Result: = True;
Case Encodingtype of
Etutf8:copybytes (Bom_utf8, BOM);
Etunicode:copybytes (BOM_UTF16_LSB, BOM);
Etunicodebe:copybytes (BOM_UTF16_MSB, BOM);
Etutf32:copybytes (BOM_UTF32_LSB, BOM);
Etutf32be:copybytes (BOM_UTF32_MSB, BOM);
Else
Begin
SetLength (BOM, 0);
Result: = False;
End
End
End


Detect character encoding type:

Comparebom


function detectencoding (Buffer:pansichar): Tencodingtype; overload;
Begin
If Comparebom (buffer, Bom_utf8) then
Result: = etUTF8
else if Comparebom (buffer, BOM_UTF16_LSB) then
Result: = Etunicode
else if Comparebom (buffer, BOM_UTF16_MSB) then
Result: = Etunicodebe
else if Comparebom (buffer, BOM_UTF32_LSB) then
Result: = etUTF32
else if Comparebom (buffer, BOM_UTF32_MSB) then
Result: = Etutf32be
Else
Result: = Etansi;
End

function detectencoding (stream:tstream): Tencodingtype; overload;
Var
Pos:int64;
Bytes:tbyteordermask;
Begin
SetLength (bytes, 6);
ZeroMemory (@bytes [0], Length (bytes));
POS: = stream. Seek (0, sofromcurrent);
Stream. Seek (0, sofrombeginning);
Stream. Read (Bytes[0], SizeOf (bytes));
Stream. Seek (POS, sofrombeginning);
Result: = detectencoding (Pansichar (@bytes [0]));
End


The following method shows how to save text with different encoding types:

Procedure WRITETEXT (stream:tstream; const buffer:widestring;
Const Encodingtype:tencodingtype; Withbom:boolean = False);
Var
s:ansistring;
P:pansichar;
Bom:tbyteordermask;
Bytes:integer;
Begin
P: = nil;
Bytes: = Length (buffer) * SIZEOF (Widechar);
If Withbom and Trygetbom (Encodingtype, BOM) then
Begin
Stream. Write (Bom[0], Length (BOM));
End
Case Encodingtype of
Etansi:
Begin
P: = Pansichar (buffer);
Bytes: = Length (buffer);
End
EtUTF8:
Begin
S: = utf8encode (buffer);
P: = Pansichar (s);
Bytes: = Length (s);
End
Etunicode:
Begin
P: = Pansichar (Pwidechar (buffer));
End
Etunicodebe:
Begin
Strswapbyteorder (Pwidechar (buffer));
P: = Pansichar (Pwidechar (buffer));
End
else//left to the reader to implement
Begin
Raise Exception.create (' not implemented. ');
End
End
Stream. Write (p^, bytes);
End


It should be explained that if these processes are encapsulated into objects, the structure will be clearer.

3. Simplified traditional Conversion

Simplified traditional conversion consists of two cases of simple transfer and propagation , and the principle is to find the corresponding characters by using the lookup character encoding map . On the internet there is a "using the Code table to complete internal code conversion and simple transformation of the unit" is based on this principle written, here is not detailed.

{TODO: Use OOP to encapsulate the character encoding module and provide download}
{TODO: Study Simplified Chinese Translation}

Reference articles

      • Determining the actual length of a DBCS string

Http://www.cnblogs.com/baoquan/articles/1027371.html

Delphi and character encoding (actual combat) (MultiByteToWideChar returns the converted wide string length)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.