In the Perl language, there are only two forms of strings:
(1) Byte stream string: resolves to a single byte string, regardless of the content or encoding of the string.
(2) Character stream string: According to UTF8 encoding scheme, the byte stream is parsed into a continuous character stream from left to right.
How does Perl determine whether a string is a byte stream string or a character stream string encoded with UTF8?
Inside Perl, each string has a UTF8 encoding flag, which has two states: on or off. If the flag is on, it is a character stream string, and the inverse is a byte stream string. By default, the encoding identity of the Perl language string is the off state, which means that the string is manipulated as a byte-stream string type.
To turn the UTF8 encoding flag on and off for a string, you can use the function _utf8_on () and _utf8_off () of the Encode module.
If the contents of a variable are read from a file, and the file is UTF8 encoded, the string containing the UTF8 encoded character defaults to the byte stream string. If you need to convert to a character stream string, you also need to use encode::_utf8_on () to turn on the identity that the UTF8 character stream handles.
File Input.txt encoded in UTF8:
Use Encode;use UTF8; the #使源代码中包含汉字的字符串常量统一编程字符流字符串形式open (in, ' Input.txt ') and while ($line =<in>) {chomp ($line); ENCODE::_UTF8_ON ($line); if ($line =~/^[you, I, he]/) {print Encode ("gb2312", $line). " \ n ";}} Close (in);
Output Result:
Note: The source code itself is also the UTF8 encoding format, the regular expression contains the Chinese characters are UTF8 encoded, if you let these characters as a character stream constant to handle, must be added to the head "use UTF8" instructions, This description causes the string constants that appear in the source code to contain Chinese characters to be uniformly translated into a character stream string form.
Perl Unicode Programming