PHP parses transcoding bug of html class library simple_html_dom

Source: Internet
Author: User
This article mainly introduces the transcoding bug of PHP parsing html class library simple_html_dom. For more information, see some articles using simple_html_dom in the next few days. The encoding of different websites is basically gbk gb2312 UTF-8 in China. Most of them are gb2312 and UTF-8.

The simple_html_dom method of my current version is convert_text.

The code is as follows:
// PaperG-Function to convert the text from one character set to another if the two sets are not the same.
Function convert_text ($ text)
{
Global $ debug_object;
If (is_object ($ debug_object) {$ debug_object-> debug_log_entry (1 );}
$ Converted_text = $ text;
$ SourceCharset = "";
$ TargetCharset = "";
If ($ this-> dom)
{
$ SourceCharset = strtoupper ($ this-> dom-> _ charset );
$ TargetCharset = strtoupper ($ this-> dom-> _ target_charset );
}
If (is_object ($ debug_object) {$ debug_object-> debug_log (3, "source charset:". $ sourceCharset. "target charaset:". $ targetCharset );}
If (! Empty ($ sourceCharset )&&! Empty ($ targetCharset) & (strcasecmp ($ sourceCharset, $ targetCharset )! = 0 ))
{
// Check if the reported encoding cocould have been incorrect and the text is actually already UTF-8
If (strcasecmp ($ targetCharset, 'utf-8') = 0) & ($ this-> is_utf8 ($ text )))
{
$ Converted_text = $ text;
}
Else
{
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
}
}
// Lets make sure that we don't have that silly BOM issue with any of the UTF-8 text we output.
If ($ targetCharset = 'utf-8 ')
{
If (substr ($ converted_text, 0, 3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 3 );
}
If (substr ($ converted_text,-3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 0,-3 );
}
}
Return $ converted_text;
}

Let's look at this line:

The code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );

Transcoding is incorrect. For example, the gb2312 text is converted:

The code is as follows:
The 24-year-old Han Zhuangzhuang not only scored a zero penalty score in the April 26 Longines International Federation of Marathon World Cup Chinese league qualifying tournament held at the Maraton Park on April 9, 2014... the first time Zhao Zhiwen, the first Olympic contestant, received a zero penalty score, it took 77 seconds to 07...

The facts prove that the transcoding function is not properly handled. Because I only want to use simple_html_dom to build the dom. I am not planning to take the time to handle this bug. But simply put

The code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );

Change

The code is as follows:
$ Converted_text = $ text;

That's all. The idea is to cancel transcoding. Okay, you don't have to worry about your work. you can continue.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.