Perl Chinese Processing tips-Application Tips

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Perl has begun to use UTF8 coding internally to represent characters from 5.6, meaning that the processing of Chinese and other language characters should be completely free of problems. We just need to take advantage of the Encode module to give full play to Perl's UTF8 character.

The following is an example of the processing of Chinese text, such as having a string "Test text," which we want to split into a single character, which can be written like this:

Use Encode;
$dat = "Test text";
$str =decode ("gb2312", $dat);
@chars =split//, $STR;
foreach $char (@chars) {
Print encode ("gb2312", $char), "\ n";
}

As a result, everyone will try to know, it should be satisfactory.

Here the main use of the Encode module decode, encode functions. To understand the role of these two functions, we need to be clear about several concepts:

1. The Perl string is encoded using UTF8, which consists of Unicode characters rather than a single byte, and each UTF8 encoded Unicode character takes up 1~4 bytes (variable length).

2, enter or leave the Perl processing environment (such as output to the screen, read and save files, etc.) when not directly using the Perl string, but the need to convert the Perl string into a byte stream, the conversion process to use which encoding is entirely dependent on you (or by Perl). Once the Perl string is encoded into the byte stream, the concept of the character does not exist, and it becomes a pure grouping of bytes, and it is your job to explain these combinations.

We can see that if we want Perl to use our character concepts to treat text, the text data needs to be kept in the form of a Perl string. But every character we write is generally used as a pure ASCII character Fu Paocun (including a string written in the program), which is the form of a byte stream, where the encode and decode functions are needed.

The encode function, by definition, is used to encode Perl strings. It encodes the characters in the Perl string in the specified encoding format and eventually translates into a byte stream, so it is often necessary to deal with things outside of the Perl processing environment. The format is simple:
$octets = Encode (ENCODING, $string [, CHECK])

$string: Perl string
Encoding: is the given encoding method
$octets: is the byte stream after the encoding
Check: represents how to handle distorted characters (that is, a character that Perl does not recognize) when converting. Generally do not need to use

The encoding mode varies greatly depending on the language environment, which can be identified by default for UTF8, ASCII, Ascii-ctrl,
Iso-8859-1 and so on.

The Decode function is used to decode the byte stream. It interprets the given byte stream according to the encoding format you give it, converts it to a Perl string using UTF8 encoding, and generally text data obtained from a terminal or file should be converted to a Perl string in decode. Its format is:

$string = Decode (ENCODING, $octets [, CHECK])
$string, ENCODING, $octets, and check have the same meaning.

Now it's easy to understand the procedure written above. Because the string is written in clear text, storage is already the form of the stream of words, loss of the original meaning, so the first thing to use the Decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, here decode also use the GB2312 encoding format. After conversion, Perl treats characters like we do, usually the function of the string to manipulate the characters can be handled, except for those who have the string as a heap of bytes of functions (such as VEC, pack, unpack, etc.). So split can cut the string into a single character. Finally, because in the output can not directly use the UTF8 encoded strings, but also need to use the Encode function to encode the characters of the gb2312 format of the byte stream, and then print output.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Perl Chinese Processing tips-Application Tips

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Perl Chinese Processing tips-Application Tips

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support