PHP Parsing HTML class library simple_html_dom transcoding bug_php Tutorial

Source: Internet
Author: User
These days have been using simple_html_dom to catch some articles. The coding of different websites is basically GBK gb2312 utf-8 in China. and to gb2312 and Utf-8 majority.

My version of the simple_html_dom has a way convert_text is this way.

Copy CodeThe code is as follows:
Paperg-function to convert the text from one character set to another if the the and the sets is not the same.
function Convert_text ($text)
{
Global $debug _object;
if (Is_object ($debug _object)) {$debug _object->debug_log_entry (1);}
$converted _text = $text;
$sourceCharset = "";
$targetCharset = "";
if ($this->dom)
{
$sourceCharset = Strtoupper ($this->dom->_charset);
$targetCharset = Strtoupper ($this->dom->_target_charset);
}
if (Is_object ($debug _object)) {$debug _object->debug_log (3, "source CharSet:"). $sourceCharset. "Target Charaset:". $targetCharset);}
if (!empty ($sourceCharset) &&!empty ($targetCharset) && (strcasecmp ($sourceCharset, $targetCharset)! = 0))
{
Check if the reported encoding could has been incorrect and the text is actually already UTF-8
if ((strcasecmp ($targetCharset, ' UTF-8 ') = = 0) && ($this->is_utf8 ($text)))
{
$converted _text = $text;
}
Else
{
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
}
}
Lets Make sure that we don't have this silly BOM issue with any of the utf-8 text we output.
if ($targetCharset = = ' UTF-8 ')
{
if (substr ($converted _text, 0, 3) = = "\XEF\XBB\XBF")
{
$converted _text = substr ($converted _text, 3);
}
if (substr ($converted _text,-3) = = "\XEF\XBB\XBF")
{
$converted _text = substr ($converted _text, 0,-3);
}
}
return $converted _text;
}

Look at this line:

Copy the Code code as follows:
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);

will cause incorrect transcoding. For example, the gb2312 text will be converted to:

Copy the Code code as follows:
April 26 at the chain à at Park Equestrian Stadium, the 2014 Longines International Ma Lian Venue Obstacle World Cup China League qualifying tournament, 24-year-old Han strong not only to get 0 penalty scores ... 7th åœ 椾 including Olympic rider Zhao Wen first Harvest 0 penalty points, spents 77 seconds 07 ...

As a matter of fact, it proves that the transcoding function is not handled well. Because I use this simple_html_dom just want to build DOM. I'm not going to take the time to deal with this bug very well. Instead, simply put

Copy the Code code as follows:
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);

Change into

Copy the Code code as follows:
$converted _text = $text;

On the line. The idea is to cancel its transcoding. Well, the work doesn't have to be tangled.

http://www.bkjia.com/PHPjc/774994.html www.bkjia.com true http://www.bkjia.com/PHPjc/774994.html techarticle these days have been using simple_html_dom to catch some articles. The coding of different websites is basically GBK gb2312 utf-8 in China. and to gb2312 and Utf-8 majority. My version of the Simple_html_dom has a side ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.