This article mainly introduces the transcoding bug of PHP parsing html class library simple_html_dom. For more information, see some articles using simple_html_dom in the next few days. The encoding of different websites is basically gbk gb2312 UTF-8 in China. Most of them are gb2312 and UTF-8.
The simple_html_dom method of my current version is convert_text.
The code is as follows:
// PaperG-Function to convert the text from one character set to another if the two sets are not the same.
Function convert_text ($ text)
{
Global $ debug_object;
If (is_object ($ debug_object) {$ debug_object-> debug_log_entry (1 );}
$ Converted_text = $ text;
$ SourceCharset = "";
$ TargetCharset = "";
If ($ this-> dom)
{
$ SourceCharset = strtoupper ($ this-> dom-> _ charset );
$ TargetCharset = strtoupper ($ this-> dom-> _ target_charset );
}
If (is_object ($ debug_object) {$ debug_object-> debug_log (3, "source charset:". $ sourceCharset. "target charaset:". $ targetCharset );}
If (! Empty ($ sourceCharset )&&! Empty ($ targetCharset) & (strcasecmp ($ sourceCharset, $ targetCharset )! = 0 ))
{
// Check if the reported encoding cocould have been incorrect and the text is actually already UTF-8
If (strcasecmp ($ targetCharset, 'utf-8') = 0) & ($ this-> is_utf8 ($ text )))
{
$ Converted_text = $ text;
}
Else
{
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
}
}
// Lets make sure that we don't have that silly BOM issue with any of the UTF-8 text we output.
If ($ targetCharset = 'utf-8 ')
{
If (substr ($ converted_text, 0, 3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 3 );
}
If (substr ($ converted_text,-3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 0,-3 );
}
}
Return $ converted_text;
}
Let's look at this line:
The code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
Transcoding is incorrect. For example, the gb2312 text is converted:
The code is as follows:
The 24-year-old Han Zhuangzhuang not only scored a zero penalty score in the April 26 Longines International Federation of Marathon World Cup Chinese league qualifying tournament held at the Maraton Park on April 9, 2014... the first time Zhao Zhiwen, the first Olympic contestant, received a zero penalty score, it took 77 seconds to 07...
The facts prove that the transcoding function is not properly handled. Because I only want to use simple_html_dom to build the dom. I am not planning to take the time to handle this bug. But simply put
The code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
Change
The code is as follows:
$ Converted_text = $ text;
That's all. The idea is to cancel transcoding. Okay, you don't have to worry about your work. you can continue.