Perl's processing of Chinese (encode, decode)

Source: Internet
Author: User

Transfer from Forum:

Http://club.topsage.com/thread-2468696-1-1.html

Perl starts to use utf8 encoding internally to represent characters starting from 5.6. That is to say, there should be no problem in processing Chinese characters and other language characters. We only need to make good use of the encode module to make full use of the advantages of Perl utf8 characters.
The following describes how to process Chinese text. For example, if you want to split a Chinese string into a single character, you can write it like this:

  1. Use encode;
  2. $ Dat = "test text ";
  3. $ STR = decode ("gb2312", $ dat );
  4. @ Chars = Split //, $ STR;
  5. Foreach $ char (@ chars ){
  6. Print encode ("gb2312", $ char), "\ n ";
  7. }

Copy code

As a result, we can see it after a try. It should be satisfactory.

Decode and encode functions of the encode module are used here. To understand the functions of these two functions, we need to understand several concepts:
1. a Perl string is UTF-8 encoded and consists of Unicode characters instead of individual bytes. Each UTF-8 encoded Unicode Character occupies 1 ~ 4 bytes (extended ).
2. When you enter or exit the Perl processing environment (for example, output to the screen, read and save files), instead of directly using the Perl string, you need to convert the Perl string into a word-based throttling, the encoding method used in the conversion process depends on you (or Perl ). Once a Perl string is encoded into a byte stream, the character concept does not exist and becomes a pure byte combination. How to interpret these combinations is your own work.

We can see that if you want Perl to treat text according to our character concept, text data must always be stored in the Perl string format. However, every character we write is generally saved as plain ASCII characters (including strings written in plain text in the Program), that is, the form of byte streams, here we need the help of the encode and decode functions.

The encode function, as its name implies, is used to encode a Perl string. It encodes the characters in a Perl string in the specified encoding format and finally converts them into byte streams. Therefore, it is often needed to deal with things outside the Perl processing environment. The format is simple:

  1. $ Octets = encode (encoding, $ string [, check])

Copy code

$ String: Perl string
Encoding: Specifies the encoding method.
$ Octets: encoded byte stream
Check: indicates how to handle distorted characters during conversion (that is, the characters not recognized by Perl ). Generally do not need to use the encoding method depending on the language environment has a great change, by default can identify utf8, ASCII, ASCII-Ctrl, iso-8859-1 and so on.

The decode function is used to decode byte streams. It interprets the given byte stream according to your encoding format and converts it to a Perl string encoded using utf8, in general, the text data obtained from a terminal or file should be converted to a Perl string in the form of decode. The format is as follows:

  1. $ String = decode (encoding, $ octets [, check])

Copy code

$ String, encoding, $ ETS, and check have the same meanings.

Now it is easy to understand the program written above. Because the string is written in plain text and stored in byte streams, it loses its original meaning. Therefore, you must first use the decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, Here decode also uses gb2312 encoding format. After the conversion, Perl treats the characters in the same way as we do. functions that operate on strings can basically process the characters correctly, except for functions that originally treat strings as a heap of bytes (such as VEC, pack, and unpack ). So split can cut the string into a single character. Finally, because UTF-8 encoded strings cannot be used directly during output, the cut characters must be encoded using the encode function
The byte stream in gb2312 format is output with print.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.