How to convert gb2312 to utf8:
Use encode;
My $ STR = "Chinese ";
$ Str_cnsoftware = encode ("UTF-8", decode ("gb2312", $ Str ));
Utf8 to gb2312:
Use encode;
My $ STR = "utf8 Chinese ";
$ Str_cnsoftware = encode ("gb2312", decode ("UTF-8", $ Str );
Or use
Use encode;
Encode: from_to ($ octets, "iso-8859-1", "cp1250 ");
This article will focus on the transformation between different Perl encodings. Perl has used utf8 encoding internally to represent Characters Since 5.6, that is to say, the processing of Chinese characters and other language characters should be completely correct.
Transformation between different Perl encodings
Perl starts to use utf8 encoding internally to represent characters starting from 5.6. That is to say, there should be no problem in processing Chinese characters and other language characters. We only need to make good use of the encode module to make full use of the advantages of Perl utf8 characters.
The following describes how to process Chinese text. For example, if you want to split a Chinese string into a single character, you can write it like this:
- Useencode;
- $ Dat = "test text ";
- $ STR = decode ("gb2312", $ dat );
- @ Chars = Split //, $ STR;
- Foreach $ char (@ chars ){
- Printencode ("gb2312", $ char), "/N ";
- }
As a result, we can see it after a try. It should be satisfactory.
Decode and encode functions of the encode module are used here. To understand the functions of these two functions, we need to understand several concepts:
1. a Perl string is UTF-8 encoded and consists of Unicode characters instead of individual bytes. Each UTF-8 encoded Unicode Character occupies 1 ~ 4 bytes (extended ).
2. When you enter or exit the Perl processing environment (for example, output to the screen, read and save files), instead of directly using the Perl string, you need to convert the Perl string into a word-based throttling, the encoding method used in the conversion process depends on you (or Perl ). Once a Perl string is encoded into a byte stream, the character concept does not exist and becomes a pure byte combination. How to interpret these combinations is your own work.
We can see that if you want Perl to treat text according to our character concept, text data must always be stored in the Perl string format. However, every character we write is generally saved as plain ASCII characters (including strings written in plain text in the Program), that is, the form of byte streams, here we need the help of the encode and decode functions.
◆The encode function, as its name implies, is used to encode a Perl string..
It encodes the characters in a Perl string in the specified encoding format and finally converts them into byte streams. Therefore, it is often needed to deal with things outside the Perl processing environment. The format is simple:
$ Octets = encode (encoding, $ string [, check])
$ String: Perl string
Encoding: Specifies the encoding method.
$ Octets: encoded byte stream
Check: indicates how to handle distorted characters during conversion (that is, the characters not recognized by Perl ). Generally do not need to use the encoding method depending on the language environment has a great change, by default can identify utf8, ASCII, ASCII-Ctrl, iso-8859-1 and so on.
◆The decode function is used to decode byte streams..
It interprets the given byte stream according to your encoding format and converts it to a Perl string encoded using utf8, in general, the text data obtained from a terminal or file should be converted to a Perl string in the form of decode. The format is as follows:
$ String = decode (encoding, $ octets [, check])
$ String, encoding, $ ETS, and check have the same meanings.
Now it is easy to understand the program written above. Because the string is written in plain text and stored in byte streams, it loses its original meaning. Therefore, you must first use the decode function to convert it to a Perl string, because Chinese characters are generally encoded in gb2312 format, Here decode also uses gb2312 encoding format. After the conversion, Perl treats the characters in the same way as we do. functions that operate on strings can basically process the characters correctly, except for functions that originally treat strings as a heap of bytes (such as VEC, pack, and unpack ). So split can cut the string into a single character. Finally, because UTF-8 encoded strings cannot be used directly during output, you also need to use the encode function to encode the cut characters into a byte stream in gb2312 format, and then print the output.
In addition, you can use the following method to "Guess" the encoding of the string, but I have tried it, and it does not always work.
Useencodeqw/from_to /;
Useencode: guessqw/EUC-jpshiftjis /;
Openinfile "..."; # The input file is a SHIFT-JIS-encoded file here only converts the first line for a test
- my$str=<INFILE>
- my$enc=guess_encoding($str);
- if(ref$enc){
- $from=$enc->name;
- }else{
- $from="shiftjis";
- }
-
- from_to($str,$from,$to);
- printSTDOUT"ThetestStringis:$str";
-
When the input file line record starts with a Japanese character, it can be determined that the encoding type is shift-JIS. however, when a group of data starts with a comma or other data, you cannot guess it. the reason is not clear yet.