Perl Chinese processing skills

Last Update:2018-12-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Perl starts to use utf8 encoding internally to represent characters starting from 5.6. That is to say, there should be no problem in processing Chinese characters and other language characters. We only need to make good use of the encode module to make full use of the advantages of Perl utf8 characters.

The following describes how to process Chinese text. For example, if you want to split a Chinese string into a single character, you can write it like this:

Use encode;
$ Dat = "test text ";
$ STR = decode ("gb2312", $ dat );
@ Chars = Split //, $ STR;
Foreach $ char (@ chars ){
Print encode ("gb2312", $ char), "\ n ";
}

As a result, we can see it after a try. It should be satisfactory.

Decode and encode functions of the encode module are used here. To understand the functions of these two functions, we need to understand several concepts:

1. a Perl string is UTF-8 encoded and consists of Unicode characters instead of individual bytes. Each UTF-8 encoded Unicode Character occupies 1 ~ 4 bytes (extended ).

2. When you enter or exit the Perl processing environment (for example, output to the screen, read and save files), instead of directly using the Perl string, you need to convert the Perl string into a word-based throttling, the encoding method used in the conversion process depends on you (or Perl ). Once a Perl string is encoded into a byte stream, the character concept does not exist and becomes a pure byte combination. How to interpret these combinations is your own work.

We can see that if you want Perl to treat text according to our character concept, text data must always be stored in the Perl string format. However, each character we write is generally saved as a pure ASCII character (includingProgramString written in plain text), that is, the form of byte stream. Here we need the help of the encode and decode functions.

The encode function, as its name implies, is used to encode a Perl string. It encodes the characters in a Perl string in the specified encoding format and finally converts them into byte streams. Therefore, it is often needed to deal with things outside the Perl processing environment. The format is simple:
$ Octets = encode (encoding, $ string [, check])

$ String: Perl string
Encoding: Specifies the encoding method.
$ Octets: encoded byte stream
Check: indicates how to handle distorted characters during conversion (that is, the characters not recognized by Perl ). Generally, you do not need to use

The encoding method varies greatly depending on the language environment. By default, utf8, ASCII, ASCII-Ctrl,
Iso-8859-1 and so on.

The decode function is used to decode byte streams. It interprets the given byte stream according to your encoding format and converts it to a Perl string encoded using utf8, in general, the text data obtained from a terminal or file should be converted to a Perl string in the form of decode. The format is as follows:

$ String = decode (encoding, $ octets [, check])
$ String, encoding, $ ETS, and check have the same meanings.

now it is easy to understand the program written above. Because the string is written in plain text and stored in byte streams, it loses its original meaning. Therefore, you must first use the decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, Here decode also uses gb2312 encoding format. After the conversion, Perl treats the characters in the same way as we do. functions that operate on strings can basically process the characters correctly, except for functions that originally treat strings as a heap of bytes (such as VEC, pack, and unpack ). So split can cut the string into a single character. Finally, because UTF-8 encoded strings cannot be used directly during output, you also need to use the encode function to encode the cut characters into a byte stream in gb2312 format, and then print the output.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Perl Chinese processing skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Perl Chinese processing skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support