This article gives an example of how PHP solves the problem of using iconv Chinese truncation. Share to everyone for your reference. The specific analysis is as follows:
Today I did a collection program, the principle is very simple, using the Curl method to get the HTML of the other page of the analysis, and then extract the required data and save in the database.
Because the offset page is GB2312 encoded, the UTF-8 encoding is used locally. Therefore, the code conversion is needed after the acquisition.
Encoding conversion using the Iconv method
The iconv-string is converted according to the required character encoding
String Iconv (String $in _charset, String $out _charset, String $str)
Encodes the string str from the In_charset conversion to the Out_charset.
The conversion method is very simple, the direct use of the Iconv method can be
<?php
$content = iconv (' GB2312 ', ' UTF-8 ', $content);//$content for captured content
?>
Tested a few pages, can be normal collection. However, in the following collection, a few pages are not complete collection.
First consider whether the error is correct, and then troubleshoot the problem after checking. After the investigation, found that after the Iconv transcoding content than the acquisition of the content of a large segment.
View Apache log, see hint:notice:iconv (): detected an illegal character in input string.
Look up the manual and see the following instructions
If you add a string//translit after Out_charset, the transliteration (transliteration) feature is enabled. This means that when a character cannot be represented by the target character set, it can be approximated by one or more similar characters.
If you add a string//ignore, characters that cannot be expressed in the target character set will be silently discarded. Otherwise, Str starts truncated from the first invalid character and causes a e_notice.
The original Iconv encountered unrecognized content, will be truncated from the first unrecognized character, and generate a e_notice. So the content behind it is discarded.
Adding//ignore to the output character set discards the unrecognized content and does not truncate and discard subsequent content.
After modifying the program, everything works.
<?php
$content = iconv (' GB2312 ', ' Utf-8//ignore ', $content);//$content for the collected content
?>
Tips: When using Iconv, if you want to use UTF-8 encoding, use UTF-8 instead of UTF8, because UTF8 some servers will have problems.
I hope this article will help you with your PHP program design.