[Erlang 00124] Erlang Unicode-makeup

Source: Internet
Author: User
Tags printable characters


Recently I read the bring Unicode to Erlang shared by Patrik on Erlang User Conference 2013! Video, this sharing well combed Erlang Unicode-related issues, basically explained the using Unicode in Erlang once. I learned a little more, sorted it into text, and added some missing content in [Erlang 0062] Erlang Unicode.

 

The video is here: http://www.youtube.com/watch? V = M6hPLCA0F-Y

PDF here: http://www.erlang-factory.com/upload/presentations/847/PatrikEUC2013.pdf

 

Evolution

 

Erlang's support for Unicode is affected by environment variables and OTP versions. Most of the time, how to display Unicode data is affected. The following describes the evolution of Unicode support by OTP versions: problem domain

 

Let's take a look at the specific details of the Erlang Unicode problem domain:

Erlang Shell
Can I enter Chinese Characters in Erlang shell?
Can the Erlang shell display Chinese characters?
Erlang code
Code file encoding method
Files
How to resolve file names

 

 

 

Next we will break through them one by one:

 

Can I enter Chinese Characters in Erlang shell?

 

Lang and lc_ctype environment variables affect Erlang shell and tell the terminal program whether to process Unicode (this environment variable also affects file name parsing. otpype + FNA will be used by default and will be mentioned later ), you can use IO: getopts () to check this parameter (). you can output echo $ Lang and echo $ lc_ctype. For example, on my centos machine, the environment variable is

# echo $LANGen_US.UTF-8
Next we will set the variable to Latin.

 LC_CTYPE=en_US.ISO-8859-1  /usr/local/bin/erl

  

Try to enter Chinese characters to see how weird the situation will be. How can this problem be solved? IO: setopts ([{encoding, Unicode}]). the following test starts with the Latin parameter. Try to enter the Chinese character "we" Although the displayed list is correct [230,136,145,228,187,172]. however, the characters are displayed incorrectly. then set encoding to Unicode through setopts to solve this problem.
Lc_ctype = en_US.ISO-8859-1/usr/local/bin/erlerlang r16b02 (erts-5.10.3) [Source] [64-bit] [SMP: 2: 2] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v5.10.3 (abort with ^ g) 1> IO: getopts (). [{expand_fun, # Fun <group.0.565o74 >}, {echo, true}, {binary, false}, {encoding, Latin1}] 2> "210 \ 221 [C. [230,136,145,228,187,172] 3> IO: setopts ([{encoding, Unicode}]). ok4> 4> "we ". [25105, 20204] 5>

  

When Erlang is started and uses-oldshell or-noshell, Latin1 is used by default (bytewise encoding represents one character in a single byte). When the interactive shell is started, the encoding method is selected according to the environment variable configuration. after Erlang is started, you can use IO: setopts to modify the global encoding mode, No matter what parameters are used at the beginning of the system startup. see the following test:
Erlang/OTP 17 [erts-6.0] [Source] [64-bit] [SMP: 4: 4] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v6.0 (abort with ^ g) 1> "we ". "We" 2> IO: Format ("~ TP ", [V (1)])." We "ok4> IO: Format ("~ TS ", [V (1)]). We ok5> IO: Format ("~ TS ", [lists: seq (20204,20220)]). We recommend that you raise your opinion about the price of your goods ("~ TS ", [lists: seq (20204,20290)]). the prices of these documents are as follow: some of the best guys will have an umbrella. Wei Chuan, Wei Chuan, Wei, Yu, Yu, Lu, Yu, Wei, Wei i/O: getopts (). [{expand_fun, # Fun <group.0.100149429 >}, {echo, true}, {binary, false}, {encoding, Unicode}] 8> IO: setopts ([{encoding, latin1}]). ok9> IO: Format ("~ TS ", [lists: seq (20204,20290)]). \ x {4eec} \ x {4eed} \ x {4eee} \ x {4eef} \ x {4ef0} \ x {4ef1} \ x {4ef2} \ x {4ef3} \ x {4ef4} \ x {4ef5} \ x {4ef6} \ x {4ef7} \ x {4ef8} \ x {4ef9} \ x {4efa} \ x {4efb} \ x {4efc} \ x {4efd} \ x {4efe} \ x {4eff} \ x {4f00} \ x {4f01} \ x {4f02} \ x {4f03} \ x {4f04} \ x {4f05} \ x {4f06} \ x {4f07} \ x {4f08} \ x {4f09} \ x {4f0a} \ x {4f0b} \ x {4f0c} \ x {4f0d} \ x {4f0e} \ x {4f0f} \ x {4f10} \ x {4f11} \ x {4f12} \ x {4f13} \ x {4f14} \ x {4f15} \ x {4f16} \ x {4f17} \ x {4f18} \ x {4f19} \ x {4f1a} \ x {4f1b} \ x {4f1c} \ x {4f1d} \ x {4f1e} \ x {4f1f} \ x {4f20} \ x {4f21} \ x {4f22} \ x {4f23} \ x {4f24} \ x {4f25} \ x {4f26} \ x {4f27} \ x {4f28} \ x {4f29} \ x {4f2a} \ x {4f2b} \ x {4f2c} \ x {4f2d} \ x {4f2e} \ x {4f2f} \ x {4f30} \ x {4f31} \ x {4f32} \ x {4f33} \ x {4f34} \ x {4f35} \ x {4f36} \ x {4f37} \ x {4f38} \ x {4f39} \ x {4f3a} \ x {4f3b} \ x {4f3c} \ x {4f3d} \ x {4f3e} \ x {4f3f} \ x {4f40} \ x {4f41} \ x {4f42} ok10> IO: format ("~ TS ", [V (1)]). \ x {6211} \ x {4eec} ok11> IO: setopts ([{encoding, Unicode}]). ok12> IO: Format ("~ TS ", [V (1)]). We ok13> IO: Format ("~ TS ", [lists: seq (20204,20290)]). the prices of these documents are as follow: some of the best guys will have an umbrella. Wei Chuan, Wei Chuan, Wei, Yu, Yu, Lu, Yu, Wei, Wei too many rows too many ok14>

 

Can the Erlang shell display Chinese characters?

 

The Erlang shell mentioned previously contains various strange ways to display text constants. In fact, it is derived from the string heuristic detection mechanism ("heuristic string detection"). In short, Erlang shell will detect the list, whether the data in binary can contain printable characters, such as the following binary string <230,136,145,228,187,172,229,173,166,228,185,160>. it is deemed to be printable, and the output is <"we learn"/utf8>.

Do you still remember the data output technique? The output data is intelligently printed into characters by shell. How can this problem be solved? Append a value of 0 to the end of the data, for example, [25105]. will print out "I", [, 0] is output as is. this technique actually avoids the "heuristic string detection" mechanism by adding 0. in the following experiment, you must note that the Erl startup parameter is:

erl +pc unicode

  

+ PC: select the shell printable character range, which can be ERL + PC Latin1 or ERL + PC Unicode. In the following experiment, [25105] It is not parsed as "I ".

By default, the Erl startup parameter is Latin.

 

# ERL + PC unicodeerlang/OTP 17 [erts-6.0] [Source] [64-bit] [SMP: 2: 2] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v6.0 (abort with ^ g) 1 ><< 230,136,145,228,187,172,229,173,166,228,185,160>. <"Learning"/utf8> 2> <230,136,145,228,187,172,229,173,166,228,185,160, 69,114,108, 97,110,103>. <"we learn Erlang"/utf8> 3> $ me. 251054> <230,136,145,228,187,172,229,173,166,228,185,160, 69,114,108, 97,110,103, 0>. <230,136,145,228,187,172,229,173,166,228,185,160, 69,114,108, 110,103, 0> 5> [25105]. "I" 6> [25105, 0]. [25105, 0]

 

 

# Erlerlang/OTP 17 [erts-6.0] [Source] [64-bit] [SMP: 2: 2] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v6.0 (abort with ^ g) 1> $ me. 251052> [25105]. [25105] 3>

 

IO: printable_range/0 and io_lib: printable_list/1 functions can help us check the printable character range of the Current Shell and determine whether a list is printable. See the following example:

 

Erl + PC unicodeerlang/OTP 17 [erts-6.0] [Source] [64-bit] [SMP: 2: 2] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v6.0 (abort with ^ g) 1> io_lib: printable_list ([25105, 20204]). true2> [25105, 20204]. "We" 3> IO: printable_range (). unicode4>

  



 erlErlang/OTP 17 [erts-6.0] [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]Eshell V6.0  (abort with ^G)1> io_lib:printable_list([25105,20204]).false2> [25105,20204].[25105,20204]3>  io:printable_range().latin14> 

 

This heuristics mechanism is also used by Io (_ Lib): format/2 ,~ TP will be affected by + PC parameters ,~ TS does not.

 

# ERL + PC Unicode 7> IO: Format ("~ TS ", [[2, 25105]). I ok8> IO: Format ("~ TP ", [[25105])." I "ok9> # ERL + PC Latin1 3> IO: Format ("~ TS ", [[2, 25105]). I ok4> IO: Format ("~ TP ", [[25105]). [25105] ok5>

 

 

Code file encoding method

 

When the Erlang source code is compiled, it is easy to add a comment header to the file. If not, the file will be parsed according to the default encoding method. Epp: default_encoding/0 returns the default encoding method used by the current OTP version. r16b is Latin1 and 17.0 is utf8.
-Module (coding).-compile (export_all). A ()-> "we learn Erlang ".

  

 
Erlang/OTP 17 [erts-6.0] [Source] [64-bit] [SMP: 2: 2] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v6.0 (abort with ^ g) 1> coding: (). [69,114,108, 97,110,103] 2> IO: Format ("~ TS ", [V (1)]). Let's learn erlangok3> q (). ok4>

  

The default encoding of the r16b code file is Latin, so the following code output in r16b is as follows:
Erlang R16B02 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]Eshell V5.10.3  (abort with ^G)1> coding:a().[230,136,145,228,187,172,229,173,166,228,185,160,69,114,108,97,110,103]2> io:format("~ts",[v(1)]).æ??ä»¬å­¦ä¹ Erlangok3>

  

On the machine where R16 is located, modify the code and add the annotation header for declaring the file encoding.
%-*-Coding: UTF-8-* -- module (coding).-compile (export_all). A ()-> "we learn Erlang ".

  

If you want to show that the specified file is Latin encoded, you can add the comment header %-*-coding: Latin-1-*. For details, refer to [LINK].

 

Next we will add a method to the previous test code file to return a binary sequence:

B ()-> <"let's learn Erlang">.

  

 

#  /usr/local/bin/erlErlang R16B02 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]Eshell V5.10.3  (abort with ^G)1> coding:b().<<17,236,102,96,69,114,108,97,110,103>>2> io:format("~ts",[v(1)]).^Qìf`Erlangok3> q().ok

  

May you guess the reason for the + PC Unicode? Well, I know it's not for us to give it a try:

 

#  /usr/local/bin/erl +pc unicodeErlang R16B02 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]Eshell V5.10.3  (abort with ^G)1> coding:b().<<17,236,102,96,69,114,108,97,110,103>>2>  io:format("~ts",[v(1)]).^Qìf`Erlangok3> q().ok

  

Where is the problem? Utf8 Descriptor

 

B ()-> <"let's learn Erlang"/utf8>.
#/Usr/local/bin/erlerlang r16b02 (erts-5.10.3) [Source] [64-bit] [SMP: 2: 2] [async-threads: 10] [hipe] [kernel-Poll: false] eshell v5.10.3 (abort with ^ g) 1> coding: B (). <230,136,145,228,187,172,229,173,166,228,185,160, 69,114,108, 110,103> 2> IO: Format ("~ TS ", [V (1)]). Learn erlangok3>

  

Perform a simple experiment to see the differences between the two:

 

 erlErlang/OTP 17 [erts-6.0] [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]Eshell V6.0  (abort with ^G)1>  "αβ" .[945,946]2>2>  <<"αβ">> .<<"±²">>3>3>  <<"αβ"/utf8>> .<<206,177,206,178>>5> <<177,178>>.<<"±²">>

  

What is the above 2nd lines of test code? What is output? Let's take a look at the 5th lines of code. The output is <177,178>. In other words, the data is truncated when it is displayed.

 

How to resolve file names

 

As to whether the file name contains Unicode, unless the file name is an uncontrollable external resource, this problem can be avoided through the project protocol, and there is no need to solve this problem through code/technical means.

Adding different flags at ERL startup can control the method of parsing file names:+ FNLParse file names by Latin+ FNUParse file names by Unicode+ FNAIs automatically selected based on environment variables, which is also the current default value of the system. You can use file: native_name_encoding to check this parameter.


Eshell V5.10.3  (abort with ^G)1> file:native_name_encoding().latin12>  Eshell V6.0  (abort with ^G)1> file:native_name_encoding().utf82> 

 

Last 

Unicode, Io, file, group, user, re, wx, and string modules should pay special attention to Unicode. there should be a lot of people here to follow the regular expression. The using Unicode in Erlang document contains a large amount of information. At last, there are some common problems to solve and code. If you are interested, you can start to practice it, this is today.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.