Some articles have been captured using simple_html_dom over the past few days. The encoding of different websites is basically gbk gb2312 UTF-8 in China. Most of them are gb2312 and UTF-8.
The simple_html_dom method of my current version is convert_text.
Copy codeThe Code is as follows:
// PaperG-Function to convert the text from one character set to another if the two sets are not the same.
Function convert_text ($ text)
{
Global $ debug_object;
If (is_object ($ debug_object) {$ debug_object-> debug_log_entry (1 );}
$ Converted_text = $ text;
$ SourceCharset = "";
$ TargetCharset = "";
If ($ this-> dom)
{
$ SourceCharset = strtoupper ($ this-> dom-> _ charset );
$ TargetCharset = strtoupper ($ this-> dom-> _ target_charset );
}
If (is_object ($ debug_object) {$ debug_object-> debug_log (3, "source charset:". $ sourceCharset. "target charaset:". $ targetCharset );}
If (! Empty ($ sourceCharset )&&! Empty ($ targetCharset) & (strcasecmp ($ sourceCharset, $ targetCharset )! = 0 ))
{
// Check if the reported encoding cocould have been incorrect and the text is actually already UTF-8
If (strcasecmp ($ targetCharset, 'utf-8') = 0) & ($ this-> is_utf8 ($ text )))
{
$ Converted_text = $ text;
}
Else
{
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
}
}
// Lets make sure that we don't have that silly BOM issue with any of the UTF-8 text we output.
If ($ targetCharset = 'utf-8 ')
{
If (substr ($ converted_text, 0, 3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 3 );
}
If (substr ($ converted_text,-3) = "\ xef \ xbb \ xbf ")
{
$ Converted_text = substr ($ converted_text, 0,-3 );
}
}
Return $ converted_text;
}
Let's look at this line:
Copy codeThe Code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
Transcoding is incorrect. For example, the gb2312 text is converted:
Copy codeThe Code is as follows:
In the 2014 Longines International Marathon World Cup China League qualifying tournament held at <span style = "color: # C03"> Longines marathon </span> park at the marathon April 26, at the age of 24, Han Zhuangzhuang not only received a zero penalty score... <span style = "color: # C03"> Min included </span> Zhao Zhiwen, the first Olympic rider, received a zero penalty score in 7th seconds...
The facts prove that the transcoding function is not properly handled. Because I only want to use simple_html_dom to build the dom. I am not planning to take the time to handle this bug. But simply put
Copy codeThe Code is as follows:
$ Converted_text = iconv ($ sourceCharset, $ targetCharset, $ text );
Change
Copy codeThe Code is as follows:
$ Converted_text = $ text;
That's all. The idea is to cancel transcoding. Okay, you don't have to worry about your work. You can continue.