Unicode support for Erlang

Source: Internet
Author: User
Tags rfc

In r13a, Erlang has added support for Unicode. The data types covered in this article include: list, binary, modules involved Stdlib/unicode, Stdlib/io, Kernel/file.
Binary

The binary type attribute increases the UTF-related Type:utf8, UTF16, UTF32, which correspond to UTF8, UTF16,UTF32 encoding respectively.

Binary Constructing

When binary is built, if UTF correlation type is specified, then the value of the corresponding integer must be at: 0. 16#D7FF, 16#e000. 16#FFFD, or 16#10000..16#10ffff in these three intervals. Otherwise, ' bad argument ' will be prompted, the parameter is wrong. Depending on the type of UTF specified, the binary generated by the same data differs.

For UTF8, each integer generates 1 to 4 characters, and for UTF16 each integer generates 2 or 4 characters, and for UTF32, each integer generates 4 characters.

For example, use the Unicode 1024 character A to build a binary:

Erlang Code 
    1. 1> <<1024/utf8>>.
    2. <<208,128>>
    3. 2> <<1024/utf16>>.
    4. <<4,0>>
    5. 3> <<1024/utf32>>.
    6. <<0,0,4,0>>

< Span class= "Apple-converted-space" >< Span class= "Apple-converted-space" >

Binary match&NBSP;

When a binary match is made, if the UTF correlation type is specified, the variable will have one at: 0 if it matches successfully. 16#D7FF, 16#e000. 16#FFFD, or 16#10000..16#10ffff the integers in these three intervals. &NBSP;

is different than the UTF type, consuming (match) a different number of bytes.  

UTF8 matches 1-4 bytes (reference RFC-2279) &NBSP;
Utf16 matches 2 or 4 bytes (reference RFC-2781) Utf32 matches 4 bytes&NBSP;

For example: Continue our example above &NBSP;

Erlang Code 
    1. 4> Bin = <<1024/utf8>>.
    2. <<208,128>>
    3. 5> <<U/utf8>> = Bin.
    4. <<208,128>>
    5. 6> U.
    6. 1024



In this example, U matches 2 bytes.

For UTF-related types, the unit spec cannot be specified

List

In list, each Unicode character is represented by an integer, so the value of element in the Unicode list can be greater than 255 compared to the Latin1 list.
The following is a valid Unicode list: [1024, 1025]

We can implement the conversion of list to binary through the Unicode module.

Unicode module

First, see the following type definition:

Unicode_binary () = binary () with characters encoded in UTF-8 coding standard
Unicode_char () = integer () representing valid Unicode codepoint
Chardata () = charlist () | Unicode_binary ()
charlist () = [Unicode_char () | unicode_binary () | charlist ()]
A unicode_binary is allowed as the tail of the list

External_unicode_binary () = binary () with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16 o R UTF-32)
External_chardata () = External_charlist () | External_unicode_binary ()
External_charlist () = [Unicode_char () | external_unicode_binary () | external_charlist ()]
An external_unicode_binary is allowed as the tail of the list

Latin1_binary () = binary () with characters coded in iso-latin-1
Latin1_char () = integer () representing valid latin1 character (0-255)
Latin1_chardata () = Latin1_charlist () | Latin1_binary ()
Latin1_charlist () = [Latin1_char () | latin1_binary () | latin1_charlist ()]
A latin1_binary is allowed as the tail of the list


We can call UNICODE:CHARACTERS_TO_LIST/1 to convert Chardata or Latin1_chardata or External_chardata () to a Unicode list.

If the parameter is Latin1_chardata, then the data parameter is a iodata. Returns the result list, where each element is an integer. By default UNICODE:CHARACTERS_TO_LIST/1 calls Unicode:characters_to_list (Data, Unicode)

If our chardata are other types, we can specify inencoding type. If this function succeeds, returns {OK, list} If the failure returns {error, List (), restdata}, where list is the successful part of the conversion, Restdata is the location where the error occurred.

We can also call UNICODE:CHARACTERS_TO_BINARY/1 to convert Chardata or Latin1_chardata or External_chardata () to a binary. This function is similar to unicode:characters_to_list, except that the result is saved as binary.

If data is Latin1_chardata, then UNICODE:CHARACTERS_TO_BINARY/1 and ERLANG:IOLIST_TO_BINARY/1 function the same

There are also two BOM-related functions in the Unicode module, which can return the corresponding encoding type according to the BOM, or generate a corresponding BOM value based on the encoding type. It is often used when saving files.

Examples

1. Open UTF8 saved file
The contents of the file are as follows Test.file:
[
{desc, "This is a test file"},
{author, "Litaocheng"}
].

The format is Erlang term, which is saved with the UTF8 encoding selected.
The code is as follows:


Erlang Code 
  1. Percent read content from the file
  2. Test1 ()
  3. {OK, [Terms]} = File:consult ("test.txt"),
  4. desc = proplists:get_value (desc, Terms),
  5. _author = Proplists:get_value (Author, Terms),
  6. % out put the Desc and Author
  7. Descunibin = Iolist_to_binary (Desc),
  8. Descunilist = Unicode:characters_to_list (Descunibin),
  9. Io:format ("desc bin: ~ts~ndesc bin: ~p~n", [Descunibin, Descunibin]),
  10. Io:format ("desc list: ~ts~ndesc list: ~p~n", [Descunilist, Descunilist]).



Results:&NBSP;
Desc Bin: This is a test file &NBSP;
Desc Bin: <<232,191,153,230,152,175,228,184,128,228,184,170,230,181,139,232, &NBSP;
              175,149,230,150,135,228,187,182>>&NBSP;
Desc list: This is a test file &NBSP;
desc list: [ 36825,26159,19968,20010,27979,35797,25991,20214]&NBSP;

First, the content is converted from list to binary, and Descunibin is the corresponding Unicode binary. The final output of the Unicode list is then converted via UNICODE:CHARACTERS_TO_LIST/1. &NBSP;
we can see that all the element in the Unicode list is an integer, Unicode binary in Unicode string is encoded with UFT8. &NBSP;

2, save the data in UFT8 format &NBSP;

Erlang Code 
  1. Percent save the binary in UTF8 format
  2. Test2 ()
  3. [Desclist] = Io_lib:format ("~ts", ["This is a test file]"),
  4. Descbin = Erlang:iolist_to_binary (desclist),
  5. DescList2 = Unicode:characters_to_list (Descbin),
  6. List = Lists:concat (["[{desc,\]", DescList2, "\"}, {author, \ "Litaocheng\"}]. "]),
  7. Bin = Unicode:characters_to_binary (List),
  8. Io:format ("Bin Is:~ts~n", [bin]),
  9. File:write_file ("Test_out.txt", Bin).




Update: &NBSP;
2008.5.4:&NBSP;
[DescList ] = Io_lib:format ("~ts", ["This is a test file"]) &NBSP;
Desclist in Erlang shell is: [ 36825,26159,19968,20010,27979,35797,25991,20214]&NBSP;
in the module file, Desclist is: &NBSP;
[ 232,191,153,230,152,175,228,184,128,228,184,170,230,181,139,232,175, 
          149,230,150,135,228,187,182]

Unicode support for Erlang

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.