Perl starts to use utf8 encoding internally to represent characters starting from 5.6. That is to say, there should be no problem in processing Chinese characters and other language characters. We only need to make good use of the encode module to make full use of the advantages of Perl utf8 characters.
The following describes how to process Chinese text. For example, if you want to split a Chinese string into a single character, you can write it like this:
Use encode;
$ Dat = "test text ";
$ STR = decode ("gb2312", $ dat );
@ Chars = Split //, $ STR;
Foreach $ char (@ chars ){
Print encode ("gb2312", $ char), "\ n ";
}
As a result, we can see it after a try. It should be satisfactory.
Decode and encode functions of the encode module are used here. To understand the functions of these two functions, we need to understand several concepts:
1. a Perl string is UTF-8 encoded and consists of Unicode characters instead of individual bytes. Each UTF-8 encoded Unicode Character occupies 1 ~ 4 bytes (extended ).
2. When you enter or exit the Perl processing environment (for example, output to the screen, read and save files), instead of directly using the Perl string, you need to convert the Perl string into a word-based throttling, the encoding method used in the conversion process depends on you (or Perl ). Once a Perl string is encoded into a byte stream, the character concept does not exist and becomes a pure byte combination. How to interpret these combinations is your own work.
We can see that if you want Perl to treat text according to our character concept, text data must always be stored in the Perl string format. However, each character we write is generally saved as a pure ASCII character (includingProgramString written in plain text), that is, the form of byte stream. Here we need the help of the encode and decode functions.
The encode function, as its name implies, is used to encode a Perl string. It encodes the characters in a Perl string in the specified encoding format and finally converts them into byte streams. Therefore, it is often needed to deal with things outside the Perl processing environment. The format is simple:
$ Octets = encode (encoding, $ string [, check])
$ String: Perl string
Encoding: Specifies the encoding method.
$ Octets: encoded byte stream
Check: indicates how to handle distorted characters during conversion (that is, the characters not recognized by Perl ). Generally, you do not need to use
The encoding method varies greatly depending on the language environment. By default, utf8, ASCII, ASCII-Ctrl,
Iso-8859-1 and so on.
The decode function is used to decode byte streams. It interprets the given byte stream according to your encoding format and converts it to a Perl string encoded using utf8, in general, the text data obtained from a terminal or file should be converted to a Perl string in the form of decode. The format is as follows:
$ String = decode (encoding, $ octets [, check])
$ String, encoding, $ ETS, and check have the same meanings.
now it is easy to understand the program written above. Because the string is written in plain text and stored in byte streams, it loses its original meaning. Therefore, you must first use the decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, Here decode also uses gb2312 encoding format. After the conversion, Perl treats the characters in the same way as we do. functions that operate on strings can basically process the characters correctly, except for functions that originally treat strings as a heap of bytes (such as VEC, pack, and unpack ). So split can cut the string into a single character. Finally, because UTF-8 encoded strings cannot be used directly during output, you also need to use the encode function to encode the cut characters into a byte stream in gb2312 format, and then print the output.