These days have been using simple_html_dom to catch some articles. The coding of different websites is basically GBK gb2312 utf-8 in China. and to gb2312 and Utf-8 majority.
My version of the simple_html_dom has a way convert_text is this way.
Copy CodeThe code is as follows:
Paperg-function to convert the text from one character set to another if the the and the sets is not the same.
function Convert_text ($text)
{
Global $debug _object;
if (Is_object ($debug _object)) {$debug _object->debug_log_entry (1);}
$converted _text = $text;
$sourceCharset = "";
$targetCharset = "";
if ($this->dom)
{
$sourceCharset = Strtoupper ($this->dom->_charset);
$targetCharset = Strtoupper ($this->dom->_target_charset);
}
if (Is_object ($debug _object)) {$debug _object->debug_log (3, "source CharSet:"). $sourceCharset. "Target Charaset:". $targetCharset);}
if (!empty ($sourceCharset) &&!empty ($targetCharset) && (strcasecmp ($sourceCharset, $targetCharset)! = 0))
{
Check if the reported encoding could has been incorrect and the text is actually already UTF-8
if ((strcasecmp ($targetCharset, ' UTF-8 ') = = 0) && ($this->is_utf8 ($text)))
{
$converted _text = $text;
}
Else
{
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
}
}
Lets Make sure that we don't have this silly BOM issue with any of the utf-8 text we output.
if ($targetCharset = = ' UTF-8 ')
{
if (substr ($converted _text, 0, 3) = = "\XEF\XBB\XBF")
{
$converted _text = substr ($converted _text, 3);
}
if (substr ($converted _text,-3) = = "\XEF\XBB\XBF")
{
$converted _text = substr ($converted _text, 0,-3);
}
}
return $converted _text;
}
Look at this line:
Copy the Code code as follows:
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
will cause incorrect transcoding. For example, the gb2312 text will be converted to:
Copy the Code code as follows:
April 26 at the chain à at Park Equestrian Stadium, the 2014 Longines International Ma Lian Venue Obstacle World Cup China League qualifying tournament, 24-year-old Han strong not only to get 0 penalty scores ... 7th åœ 椾 including Olympic rider Zhao Wen first Harvest 0 penalty points, spents 77 seconds 07 ...
As a matter of fact, it proves that the transcoding function is not handled well. Because I use this simple_html_dom just want to build DOM. I'm not going to take the time to deal with this bug very well. Instead, simply put
Copy the Code code as follows:
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
Change into
Copy the Code code as follows:
$converted _text = $text;
On the line. The idea is to cancel its transcoding. Well, the work doesn't have to be tangled.
http://www.bkjia.com/PHPjc/774994.html www.bkjia.com true http://www.bkjia.com/PHPjc/774994.html techarticle these days have been using simple_html_dom to catch some articles. The coding of different websites is basically GBK gb2312 utf-8 in China. and to gb2312 and Utf-8 majority. My version of the Simple_html_dom has a side ...