Perl Unicode conversion (mostly from the Network)

Source: Internet
Author: User

Perl Unicode conversion Overview:

1. Determine the encoding method of the Input Source
2. The input source is UTF-8 encoded.
A. encode: _ utf8_on ($ Str); Enable utf8 flag.
The input source is not UTF-8 encoded.
A. $ string = decode (encoding, $ octets [, check]); convert the input source to utf8 encoding and enable utf8 flag
3. Output
$ STR = encode: encoding (encoding, $ Str); encodes a string from utf8 to the specified encoding, and disables utf8 flag.

---------------------------------------------- The following content is taken from the network ------------------------------------------------------------------------------

Perl internal form

In Perl, there are only two types of strings: one is a string (Perl strings ). The other is bytes, which is also known as octets ). Encoding types: utf8 encoding (string) and ASCII (byte stream)

Utf8 flag
How does Perl determine whether a string is a string encoded by octets or utf8? Perl relies entirely on the utf8 flag on strings. Within Perl, the string structure consists of two parts: Data and utf8 flag

For example, the string "China" is stored in Perl as follows:
Utf8 flag data
On China
If utf8 flag is on, Perl treats "China" as a utf8 string. If utf8 flag is off, Perl treats it as octets.However, it should be noted that the utf8 flag cannot be used to determine whether the string is UTF-8 encoded.

Example 1:
Use encode;
Use strict;

My $ STR = "China ";
Encode: _ utf8_on ($ Str );
Print length ($ Str). "\ n ";
Encode: _ utf8_off ($ Str );
Print length ($ Str). "\ n ";

The running result is:
Malformed UTF-8 character (unexpected end of string) at Unicode. pl line 30
2
6

Use the _ utf8_on and _ utf8_off functions of the encode module to switch the utf8 flag of the string "China. As you can see, when utf8 flag is opened, "China" is treated as a utf8 string with a length of 2. When the utf8 flag is disabled, "China" is treated as octets (byte array) and the output length is 6 (my editor uses utf8 encoding, if your editor uses gb2312 encoding, the length should be 4 ).Since "China" is originally encoded as gb2312, not utf8, but after utf8 flag is enabled, Perl treats "China" as utf8, which may cause an error: malformed UTF-8 character (unexpected end of string)

 

String Source

To apply the basic principles mentioned above, we first need to know the encoding of the string and the utf8 flag switch. Here we will discuss several situations.

1) command line parameters and standard input. The encoding of a string from a command line parameter or a standard input (stdin) is related to locale. If your locale is zh_cn or zh_cn.gb2312, the incoming string is gb2312 encoding. If your locale is zh_cn.gbk, the incoming encoding is GBK. If your encoding is zh_cn.utf8, the encoding is utf8.Whatever the encoding, The utf8 flag of the incoming string is disabled. 

2) YourSource code. It depends on the sourceCodeWhich encoding is used. In editplus, you can view and change the encoding through "file"-> "Save. In Linux, You Can cat a source code file. If Chinese characters are normally displayed, the source code encoding is consistent with that of locale.The utf8 flag of the string in the source code is also disabled..

If your source code contains Chinese characters, you 'd better follow this principle:1) Use utf8 for coding; 2) add the "use utf8;" statement at the beginning of the file. In this way, all the strings in your source code are UTF-8 encoded and the UTF-8 flag is enabled.

3) read from a file. There is no doubt what encoding your file is, and what encoding you read in it.After reading, the utf8 flag is off..

Summary: Without special processing, if the utf8 flag of a string is off, Perl treats the string as octets. At this time, we use $ string = decode (encoding, $ octets) to decode byte streams. It interprets the given byte stream according to your given encoding format (encoding), converts the byte stream from encoding to utf8 encoding, and enables utf8 flag. However, if the string is only ASCII or ebcdic encoded, The utf8 flag is not enabled. Note: I do not know how to determine whether the string is ascii or ebcdic encoding.

Output

String inProgramAfter being correctly processed, it is displayed to the user. In this case, we need to convert the string from Perl internal form into a form acceptable to the user. Simply put, it is to convert the string from utf8 encoding to the output encoding or the encoding of the Presentation Interface. At this time, we use $ STR = encode: encode ('charset', $ Str); to convert the string from utf8 encoding to the specified encoding, and disable utf8 flag.

It can also be divided into several situations:
1) standard output. The encoding of the standard output is the same as that of locale. When output, the utf8 flag should be disabled. Otherwise, the line we saw earlier warning will appear:
Wide character in print at Unicode. pl Line 10.

2 )....

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.