PHP parses the transcoding bug of the html class library simple_html_dom. Some articles have been captured using simple_html_dom over the past few days. The codes of different websites are basically gbkgb2312utf-8 in China. Most of them are gb2312 and UTF-8. One of my simple_html_dom versions is using simple_html_dom to catch some articles over the past few days. The encoding of different websites is basically gbk gb2312 UTF-8 in China. Most of them are gb2312 and UTF-8.
The simple_html_dom method of my current version is convert_text.
The code is as follows:
// PaperG-Function to convert the text from one character set to another if the two sets are not the same.
Function convert_text ($ text)
{
Global $ debug_object;
If (is_object ($ debug_object) {$ debug_object-> debug_log_entry (1 );}
$ Converted_text = $ text;
$ SourceCharset = "";
$ TargetCharset = "";
If ($ this-> dom)
{
$ SourceCharset = strtoupper ($ this-> dom-> _ charset );
$ TargetCharset = strtoupper ($ this-> dom-> _ target_charset );
}
If (is_object ($ debug_object) {$ debug_object-> debug_log (3, "source charset:". $ sourceCharset. "target charaset:". $ targetCharset );}
If (! Empty ($ sourceCharset )&&! Empty ($ targetCharset) & (strcasecmp ($ sourceCharset, $ targetCharset )! = 0 ))
{
// Check if the reported encoding cocould have been incorrect and the text is actually already UTF-8
If (strcasecmp ($ targetCharset, 'utf-8') = 0) & ($ this-> is_utf8 ($ text )))
{
$ Converted_text = $ text;
}
Else
{
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
}
}
// Lets make sure that we don't have that silly BOM issue with any of the UTF-8 text we output.
If ($ targetCharset = 'utf-8 ')
{
If (substr ($ converted_text, 0, 3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 3 );
}
If (substr ($ converted_text,-3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 0,-3 );
}
}
Return $ converted_text;
}
Let's look at this line:
The code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
Transcoding is incorrect. For example, the gb2312 text is converted:
The code is as follows:
The 24-year-old Han Zhuangzhuang not only scored a zero penalty score in the April 26 Longines International Federation of Marathon World Cup Chinese league qualifying tournament held at the Maraton Park on April 9, 2014... the first time Zhao Zhiwen, the first Olympic contestant, received a zero penalty score, it took 77 seconds to 07...
The facts prove that the transcoding function is not properly handled. Because I only want to use simple_html_dom to build the dom. I am not planning to take the time to handle this bug. But simply put
The code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
Change
The code is as follows:
$ Converted_text = $ text;
That's all. The idea is to cancel transcoding. Okay, you don't have to worry about your work. you can continue.
Bytes. The encoding of different websites is basically gbk gb2312 UTF-8 in China. Most of them are gb2312 and UTF-8. My simple_html_dom version has a side...