These days there are some articles with simple_html_dom. The coding of different websites is basically GBK gb2312 utf-8 in China. and to gb2312 and Utf-8 mostly.
My version of the simple_html_dom there is a way convert_text is like this.
Copy Code code as follows:
Paperg-function to convert the "text from one" character set to another if the two sets are is not the same.
function Convert_text ($text)
{
Global $debug _object;
if (Is_object ($debug _object)) {$debug _object->debug_log_entry (1);}
$converted _text = $text;
$sourceCharset = "";
$targetCharset = "";
if ($this->dom)
{
$sourceCharset = Strtoupper ($this->dom->_charset);
$targetCharset = Strtoupper ($this->dom->_target_charset);
}
if (Is_object ($debug _object)) {$debug _object->debug_log (3, "Source CharSet:". $sourceCharset. "Target Charaset:". $targetCharset);}
if (!empty ($sourceCharset) &&!empty ($targetCharset) && (strcasecmp ($sourceCharset, $targetCharset)! = 0))
{
Check if the reported encoding could have been incorrect and the text is actually already UTF-8
if ((strcasecmp ($targetCharset, ' UTF-8 ') = 0) && ($this->is_utf8 ($text)))
{
$converted _text = $text;
}
Else
{
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
}
}
Lets make sure so we don t have that silly BOM issue with any of the utf-8 text we output.
if ($targetCharset = = ' UTF-8 ')
{
if (substr ($converted _text, 0, 3) = = "\XEF\XBB\XBF")
{
$converted _text = substr ($converted _text, 3);
}
if (substr ($converted _text,-3) = = "\XEF\XBB\XBF")
{
$converted _text = substr ($converted _text, 0,-3);
}
}
return $converted _text;
}
Look at this line:
Copy Code code as follows:
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
Will cause the transcoding to be incorrect. For example, the gb2312 text will be translated into:
Copy Code code as follows:
April 26 in <span style= "color: #C03" > Link 濋 槼 </span> Park Equestrian Arena held the 2014 Lang International Horse League competition in China, 24-Year-old Han Zhuang not only got 0 penalty points of the results ... The 7th appearance of <span style= "color: #C03" > 鍖 椾 with </span> Olympic rider Zhao Wen first Harvest 0 penalty points, spents 77 seconds 07 ...
As a matter of fact, it is proved that the transcoding function in the inside is not handled well. Because I use this simple_html_dom just want to build DOM. I'm not going to take the time to deal with this bug very well. But simply by putting
Copy Code code as follows:
$converted _text = Iconv ($sourceCharset, $targetCharset, $text);
Change into
Copy Code code as follows:
$converted _text = $text;
On the line. The idea is to cancel its transcoding. All right, the job doesn't have to be tangled up.