Perl Chinese Processing tips-Application Tips

Source: Internet
Author: User
Perl has begun to use UTF8 coding internally to represent characters from 5.6, meaning that the processing of Chinese and other language characters should be completely free of problems. We just need to take advantage of the Encode module to give full play to Perl's UTF8 character.

The following is an example of the processing of Chinese text, such as having a string "Test text," which we want to split into a single character, which can be written like this:

Use Encode;
$dat = "Test text";
$str =decode ("gb2312", $dat);
@chars =split//, $STR;
foreach $char (@chars) {
Print encode ("gb2312", $char), "\ n";
}

As a result, everyone will try to know, it should be satisfactory.

Here the main use of the Encode module decode, encode functions. To understand the role of these two functions, we need to be clear about several concepts:

1. The Perl string is encoded using UTF8, which consists of Unicode characters rather than a single byte, and each UTF8 encoded Unicode character takes up 1~4 bytes (variable length).

2, enter or leave the Perl processing environment (such as output to the screen, read and save files, etc.) when not directly using the Perl string, but the need to convert the Perl string into a byte stream, the conversion process to use which encoding is entirely dependent on you (or by Perl). Once the Perl string is encoded into the byte stream, the concept of the character does not exist, and it becomes a pure grouping of bytes, and it is your job to explain these combinations.

We can see that if we want Perl to use our character concepts to treat text, the text data needs to be kept in the form of a Perl string. But every character we write is generally used as a pure ASCII character Fu Paocun (including a string written in the program), which is the form of a byte stream, where the encode and decode functions are needed.

The encode function, by definition, is used to encode Perl strings. It encodes the characters in the Perl string in the specified encoding format and eventually translates into a byte stream, so it is often necessary to deal with things outside of the Perl processing environment. The format is simple:
$octets = Encode (ENCODING, $string [, CHECK])

$string: Perl string
Encoding: is the given encoding method
$octets: is the byte stream after the encoding
Check: represents how to handle distorted characters (that is, a character that Perl does not recognize) when converting. Generally do not need to use

The encoding mode varies greatly depending on the language environment, which can be identified by default for UTF8, ASCII, Ascii-ctrl,
Iso-8859-1 and so on.

The Decode function is used to decode the byte stream. It interprets the given byte stream according to the encoding format you give it, converts it to a Perl string using UTF8 encoding, and generally text data obtained from a terminal or file should be converted to a Perl string in decode. Its format is:

$string = Decode (ENCODING, $octets [, CHECK])
$string, ENCODING, $octets, and check have the same meaning.

Now it's easy to understand the procedure written above. Because the string is written in clear text, storage is already the form of the stream of words, loss of the original meaning, so the first thing to use the Decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, here decode also use the GB2312 encoding format. After conversion, Perl treats characters like we do, usually the function of the string to manipulate the characters can be handled, except for those who have the string as a heap of bytes of functions (such as VEC, pack, unpack, etc.). So split can cut the string into a single character. Finally, because in the output can not directly use the UTF8 encoded strings, but also need to use the Encode function to encode the characters of the gb2312 format of the byte stream, and then print output.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.