Perl matches Chinese Characters

Source: Internet
Author: User



The regular expression character set is often used to extract or exclude Chinese characters from a string. However, this is very troublesome and the effect is not very satisfactory. In fact, Perl has started to use utf8 encoding internally to represent characters starting from 5.6. That is to say, there should be no problem in processing Chinese characters and other language characters. The key lies in the fact that the editors and file formats currently used do not support utf8 well, and Perl's powerful capabilities are wronged. In fact, we only need to make good use of the encode module to make full use of the advantages of Perl utf8 characters.

The following uses the processing of Chinese text as an example (Note: you cannot use the utf8 encoding editor to edit the following Program). For example, there is a string "test text ", we want to split this Chinese string into a single character, which can be written as follows:
Use encode;
Use encode: CN; # writable or not
$ Dat = "test text ";
$ STR = decode ("gb2312", $ dat );
@ Chars = Split //, $ STR;
Foreach $ char (@ chars ){
Print encode ("gb2312", $ char), "/N ";
As a result, we can see it after a try. It should be satisfactory.

Decode and encode functions of the encode module are used here. To understand the functions of these two functions, we need to understand several concepts:
1. a Perl string is UTF-8 encoded and consists of Unicode characters instead of individual bytes. Each UTF-8 encoded Unicode Character occupies 1 ~ 4 bytes (extended ).
2. When you enter or exit the Perl processing environment (for example, output to the screen, read and save files), instead of directly using the Perl string, you need to convert the Perl string into a word-based throttling, the encoding method used in the conversion process depends on you (or Perl ). Once a Perl string is encoded into a byte stream, the character concept does not exist and becomes a pure byte combination. How to interpret these combinations is your own work.

We can see that if you want Perl to treat text according to our character concept, text data must always be stored in the Perl string format. However, every character we write is generally saved as plain ASCII characters (including strings written in plain text in the Program), that is, the form of byte streams, here we need the help of the encode and decode functions.

The encode function, as its name implies, is used to encode a Perl string. It encodes the characters in a Perl string in the specified encoding format and finally converts them into byte streams. Therefore, it is often needed to deal with things outside the Perl processing environment. The format is simple:
$ Octets = encode (encoding, $ string [, check])
$ String is a Perl string, encoding is the given encoding method, and $ octets is the encoded byte stream, check indicates how to handle distorted characters during conversion (that is, the characters not recognized by Perl ). Generally, check is not required and Perl can be processed according to the default rules.
Encoding methods vary greatly depending on the language environment, the default can identify utf8, ASCII, ASCII-Ctrl, iso-8859-1, etc., Chinese environment (CN) added EUC-CN (equivalent to gb2312), cp936 (equivalent to GBK), Hz, and so on, as well as Japanese environment (JP) and Korean (KR, there is not a full number here.

The decode function is used to decode byte streams. It interprets the given byte stream according to your encoding format and converts it to a Perl string encoded using utf8, in general, the text data obtained from a terminal or file should be converted to a Perl string in the form of decode. The format is as follows:
$ String = decode (encoding, $ octets [, check])
$ String, encoding, $ ETS, and check have the same meanings.

Now it is easy to understand the program written above. Because the string is written in plain text and stored in byte streams, it loses its original meaning. Therefore, you must first use the decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, Here decode also uses gb2312 encoding format. After the conversion, Perl treats the characters in the same way as we do. functions that operate on strings can basically process the characters correctly, except for functions that originally treat strings as a heap of bytes (such as VEC, pack, and unpack ). So split can cut the string into a single character. Finally, because UTF-8 encoded strings cannot be used directly during output, you also need to use the encode function to encode the cut characters into a byte stream in gb2312 format, and then print the output.

This is probably the initial application of the encode module. For details, refer to the module documentation. In fact, if we use ultraeditor and other editors that support editing utf8 encoding files to write programs, we basically don't need the encode module. Just add a use utf8 statement at the beginning of the program. By default, Perl contains Unicode characters, including all characters in the program. You can use characters in the Unicode range, or even use non-English characters as identifiers, however, you may need to use the encode module for output. For example, you can use the utf8 encoding mode of UE to edit this program:
Use utf8;
$ Unit price = 10;
$ Quantity = 100;
$ Total = $ unit price * $ quantity;
Print "$ total/N ";
Can be run normally in Versions later than Perl 5.6 and provide results? :) The biggest advantage of this mode is that strings can be mixed with texts in multiple languages, even if both Chinese and Japanese letters and English letters and Arabic characters are in the same string; unlike the use of the encode module, a fixed encoding method is required. The Chinese and Japanese letters and English characters can be used at the same time because GBK contains all these characters, however, some non-Asian characters cannot be processed. So using Unicode encoding in the future is the trend of the times.

I hope this will be helpful to you.

Sender: chaoslawful (skeleton warrior), email area: Perl
Question: a supplement to Perl's processing of Chinese Characters
Mailing site: BBS shuimu Tsinghua station (Thu Oct 30 09:45:23 2003)

In versions starting from perl5.8, in addition to converting Chinese characters to utf8 characters using the encode module, you can also use the use encoding indicator for easier processing. As shown below

Use encoding 'gbk ';
$ STR = "hello ";
Print length ($ Str). "/N ";
@ Chars = Split //, $ STR;
Print "@ chars/N ";

You can also specify different encoding methods for the input and output to convert the encoding.
Use encoding 'gbk', stdout => 'utf8'; # Use GBK encoding for the input and utf8 encoding for the output.
Use encoding 'gbk', stdin => 'utf8'; # the input is UTF-8 encoded, and the output is GBK encoded.

If you want to use Chinese characters as variable names to achieve the same effect as the use utf8 indicator, you can write

Use encoding 'gbk', filter => 1;
$ Variable = 10;
Print "$ variable/N ";

However, this will reduce compatibility, so it is not recommended. For more details, refer to perldoc encoding.

Sometimes we need to consider the character encoding feature. For example, if you find garbled characters, such as "character encoding character" and Chinese character? "," "Indicates" non-medical personnel "and" medical personnel "respectively, so your regular expression can be written as follows:

$ _ = ~ /Match documents containing non-medical personnel.

Of course the default compiler is GBK, my text file is UTF-8 without Rom format.

You can use the. * command to directly output a match.


This article from the csdn blog, reproduced please indicate the source:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.