Delphi與字元編碼（實戰篇）（MultiByteToWideChar會返迴轉換後的寬字元串長度）

最後更新：2016-07-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

本文目標：

瞭解Delphi的字串類型
字元編碼的檢測與轉換
簡體繁體轉換

0. 導言

看完“.Net與字元編碼（理論篇）”，我們明白了字元是自然語言中的最小單位，在儲存和傳輸的過程中可以使用三種編碼方法：ASCII、DBCS以及Unicode。常見的DBCS編碼有GB2312、GBK和BIG5，而UTF-8、UTF-16和UTF-32則是最常用的Unicode編碼類型。

1. 字串類型

在Delphi中有兩種字串類型：AnsiString和WideString。AnsiString被稱為“長字串”(Long String)；WideString則叫做“寬字元串”（Unicode String），它和COM String (BSTR)相容。它們都是由程式在堆(Heap)上分配的並自動管理記憶體的分配和釋放。目前在Win32平台上，string類型等同於AnsiString。AnsiString還可以理解成位元組序列，它支援單位元組字元編碼(SBCS)、多位元組字元編碼(MBCS/DBCS)以及UTF-8編碼。而WideString使用UTF-16編碼，完美支援Unicode。

為了說明字元和位元組的區別，我們來看一個計算字元個數的例子：

// 假設當前系統頁為CP936(GBK 1.0)
procedure TestAnsiLength;
var
  str: string;
begin
  str := ‘漢字ABC‘;
  Assert(Length(str) = 7);      // 7個位元組
  Assert(AnsiLength(str) = 5);  // 5個字元
end;

下面是AnsiLength的兩種實現：

// uses SysUtils;
function AnsiLength(const s: string): integer;
var
  p, q: PChar;
begin
  Result := 0;
  p := PChar(s);
  q := p + Length(s);
  while p < q do
  begin
    Inc(Result);
    if p^ in LeadBytes then // 當前系統字碼頁的前導位元組數組
      Inc(p, 2)
    else
      Inc(p);
  end;
end;

// uses Windows;
function AnsiLength(const s: string): Integer;
begin
Result := MultiByteToWideChar(CP_ACP, 0, PAnsiChar(s), -1, nil, 0);
if Result > 0 then Dec(Result); // 除去終止符
end;

如果理解了.Net與字元編碼（理論篇）中的編碼知識，上面的例子還是很簡單的。

2. 字元編碼的檢測與轉換

“工欲善其事，必先利其器”，我先向大家推薦一些工具：

JCL (JEDI Code Library)
Virtual TreeView
Tnt Controls or TMS Unicode Component Pack

定義基本的類型：

  { 編碼類別型 }
  TEncodingType = (
    etAnsi,       // ANSI   format (SBCS/DBCS)
    etUTF8,       // UTF-8  format
    etUnicode,    // UTF-16 format using little endian
    etUnicodeBE,  // UTF-16 format using big endian
    etUTF32,      // UTF-32 format using little endian
    etUTF32BE     // UTF-32 format using big endian
  );

  { 位元組順序標記 }
  TByteOrderMask = array of Byte;

獲得不同編碼類別型的BOM：

CopyBytes

function TryGetBOM(const encodingType: TEncodingType; var bom: TByteOrderMask): Boolean;
begin
  Result := True;
  case encodingType of
    etUTF8:      CopyBytes(BOM_Utf8, bom);
    etUnicode:   CopyBytes(BOM_UTF16_LSB, bom);
    etUnicodeBE: CopyBytes(BOM_UTF16_MSB, bom);
    etUTF32:     CopyBytes(BOM_UTF32_LSB, bom);
    etUTF32BE:   CopyBytes(BOM_UTF32_MSB, bom);
    else
    begin
      SetLength(bom, 0);
      Result := False;
    end;
  end;
end;

檢測字元編碼類型：

CompareBOM

function DetectEncoding(buffer: PAnsiChar): TEncodingType; overload;
begin
  if CompareBOM(buffer, BOM_UTF8) then
    Result := etUTF8
  else if CompareBOM(buffer, BOM_UTF16_LSB) then
    Result := etUnicode
  else if CompareBOM(buffer, BOM_UTF16_MSB) then
    Result := etUnicodeBE
  else if CompareBOM(buffer, BOM_UTF32_LSB) then
    Result := etUTF32
  else if CompareBOM(buffer, BOM_UTF32_MSB) then
    Result := etUTF32BE
  else
    Result := etAnsi;
end;

function DetectEncoding(stream: TStream): TEncodingType; overload;
var
  pos: Int64;
  bytes: TByteOrderMask;
begin
  SetLength(bytes, 6);
  ZeroMemory(@bytes[0], Length(bytes));
  pos := stream.Seek(0, soFromCurrent);
  stream.Seek(0, soFromBeginning);
  stream.Read(bytes[0], SizeOf(bytes));
  stream.Seek(pos, soFromBeginning);
  Result := DetectEncoding(PAnsiChar(@bytes[0]));
end;

下面的方法示範了如何用不同的編碼類別型來儲存文本：

procedure WriteText(stream: TStream; const buffer: WideString;
  const encodingType: TEncodingType; withBom: Boolean = False);
var
  s: AnsiString;
  p: PAnsiChar;
  bom: TByteOrderMask;
  bytes: Integer;
begin
  p := nil;
  bytes := Length(buffer) * SizeOf(WideChar);
  if withBom and TryGetBOM(encodingType, bom) then
  begin
    stream.Write(bom[0], Length(bom));
  end;
  case encodingType of
    etAnsi:
    begin
      p := PAnsiChar(buffer);
      bytes := Length(buffer);
    end;
    etUTF8:
    begin
      s := Utf8Encode(buffer);
      p := PAnsiChar(s);
      bytes := Length(s);
    end;
    etUnicode:
    begin
      p := PAnsiChar(PWideChar(buffer));
    end;
    etUnicodeBE:
    begin
      StrSwapByteOrder(PWideChar(buffer));
      p := PAnsiChar(PWideChar(buffer));
    end;
    else  // 留給讀者去實現
    begin
      raise Exception.Create(‘Not Implemented.‘);
    end;
  end;
  stream.Write(p^, bytes);
end;

需要說明的是，如果把這些過程封裝成對象的話，結構會更清晰。

3. 簡體繁體轉換

簡體繁體轉換包括簡轉繁和繁轉簡兩種情況，其原理是利用尋找字元編碼映射表來尋找相應的字元。網上有一個“利用編碼對照表完成內碼轉換和簡繁體轉換的單元”就是基於這個原理寫的，在這裡就暫不詳述了。

{ TODO: 採用OOP來封裝字元編碼模組，並提供下載 }
{ TODO: 研究簡體繁體轉換 }

參考文章

Determining the actual length of a DBCS string

http://www.cnblogs.com/baoquan/articles/1027371.html

Delphi與字元編碼（實戰篇）（MultiByteToWideChar會返迴轉換後的寬字元串長度）

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Delphi與字元編碼（實戰篇）（MultiByteToWideChar會返迴轉換後的寬字元串長度）

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support