UTF-8 encoding in Perl

Source: Internet
Author: User
Tags add time unpack

For convenience, consider the following application: Remove all non-Chinese characters from the HTML page.

By the way, this is a secret. Once the text is properly encoded and interpreted by Perl,/W can be used to match letters, numbers, _, and Chinese characters. Is this convenient, therefore, we only need to use the following two regular expressions to remove all non-Chinese characters, including some full-angle punctuation marks (@#% % <:
$ STR = ~ S/[^ w] // G;
$ STR = ~ S/[0-9a-za-z _] // G;
The problem is how to make Perl understand our text correctly. First, let's give our test program as follows:
#! /Usr/bin/perl

use strict;use Encode;use open IN => ":raw",   OUT => ":raw";my $arg = $ARGV[0];sub aaa {        open FH, "testutf8" or die "aaa$! ";        local $/ = undef;        binmode FH, ":utf8";        my $str = <FH>;        return $str;}sub bbb {        open FH, "testutf8";        local $/ = undef;        binmode FH, ":raw";        my $str = <FH>;        $str = pack "U0C*", unpack "C*", $str;        return $str;}sub ccc {        open FH, "testutf8" or die "aaa$! ";        local $/ = undef;        binmode FH, ":raw";        my $str = <FH>;        $str = Encode::decode_utf8($str);        return $str;          }sub ddd {        open FH, "testutf8" or die "aaa$! ";        local $/ = undef;             binmode FH, ":raw";           my $str = <FH>;               $str = decode('utf-8', $str);          return $str;}sub eee {        open FH, "testgb" or die "aaa$! ";        local $/ = undef;        binmode FH, ":raw";        my $str = <FH>;        Encode::from_to($str, 'gbk', 'utf-8');        $str = Encode::decode_utf8($str);        return $str;}sub fff {        my $str = `iconv -f gbk -t utf-8 testgb`;        $str = Encode::decode_utf8($str);        return $str;}my $f;eval("$f = *$arg");for (my $i = 0; $i < 200; $i++) {        my $str = &$f();        #print "$i ",(length $str)." ";}my $str;$str = &$f();$str =~ s/[^w]//g;$str =~ s/[0-9a-zA-Z_]//g;print $str;

There are six AAA-fff methods. The AAA-DDD method is to read and transcode a text file named testutf8, And the Eee-fff method is to read the testgb text.

When the program is running, the first parameter is used to pass in the method name, like this./test. pl AAA. You can add time before the command to count the time spent. To avoid Perl interference, use open in => ": Raw", out => ": Raw" at the beginning of the program. The default input and output are not interpreted.

These six methods are tested to get the correct results, but the running speed is different under my Perl 5.8.0, as shown below:

AAA 0m0. 376 s
Bbb 0m5. 263 s
CCC 0m0. 432 s
Ddd 0m2. 668 s
Eee 0m0. 784 s
Fff 0m1. 358 s

From the results, we can see that the method BBB is the slowest. It uses the pack and unpack skills that are promoted in some articles, not only in terms of syntax, but also in terms of efficiency.
The method of DDD and CCC also uses the encode module and does the same thing, but the efficiency is very poor. It can be seen that the encode module still has defects. That is to say, the decode method is much slower than the decode_utf8 method.
The method AAA is not very good, but it also ranks first, indicating that the Perl underlying support for UTF-8 is still acceptable.
As an example that reflects the underlying performance of Perl, I changed ": utf8" to ": encoding (UTF-8)" to test it. I used 0m1. 650 s and it was 400% slower.

For more attempts, consider the GBK encoding example. Eee uses a roundabout tactic to convert from_to UTF-8 first, and then directly call decode_utf8 to avoid calling Decode. fff is similar, only the iconv process is called for conversion.
From this we can see that the cost of Linux Process generation is extremely low, and the gap between fff methods is actually smaller than that of BBB (meaning, improper programming in Perl, as a result, it is slower than calling external programs repeatedly ).

If decode is really slow, change the Eee method and call encode directly, as shown in the following figure:
$ STR = encode: Decode ('gbk', $ Str );
We found that the time is 0m0. 727 s, that is, it is faster than the roundabout method.

How can we explain this strange phenomenon,
I am too lazy to study the implementation of The encode module. Let's just speculate:

GBK decoding is much easier than UTF-8. Therefore, when the parameter is GBK, the decode method is very fast, and when the parameter is UTF-8, It is very slow.
Why is it so fast to use decode_utf8 directly (faster than the GBK parameter used by encode)? Perl has made some optimizations to its built-in features, but the efficiency is not low.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.