PHP to use iconv Chinese truncation problem solution _php tips

Source: Internet
Author: User
Tags apache log truncated

This article gives an example of how PHP solves the problem of using iconv Chinese truncation. Share to everyone for your reference. The specific analysis is as follows:

Today I did a collection program, the principle is very simple, using the Curl method to get the HTML of the other page of the analysis, and then extract the required data and save in the database.

Because the offset page is GB2312 encoded, the UTF-8 encoding is used locally. Therefore, the code conversion is needed after the acquisition.

Encoding conversion using the Iconv method

The iconv-string is converted according to the required character encoding
String Iconv (String $in _charset, String $out _charset, String $str)

Encodes the string str from the In_charset conversion to the Out_charset.

The conversion method is very simple, the direct use of the Iconv method can be

<?php 
$content = iconv (' GB2312 ', ' UTF-8 ', $content);//$content for captured content 
?> 

Tested a few pages, can be normal collection. However, in the following collection, a few pages are not complete collection.
First consider whether the error is correct, and then troubleshoot the problem after checking. After the investigation, found that after the Iconv transcoding content than the acquisition of the content of a large segment.
View Apache log, see hint:notice:iconv (): detected an illegal character in input string.

Look up the manual and see the following instructions

If you add a string//translit after Out_charset, the transliteration (transliteration) feature is enabled. This means that when a character cannot be represented by the target character set, it can be approximated by one or more similar characters.

If you add a string//ignore, characters that cannot be expressed in the target character set will be silently discarded. Otherwise, Str starts truncated from the first invalid character and causes a e_notice.

The original Iconv encountered unrecognized content, will be truncated from the first unrecognized character, and generate a e_notice. So the content behind it is discarded.

Adding//ignore to the output character set discards the unrecognized content and does not truncate and discard subsequent content.

After modifying the program, everything works.

<?php 
$content = iconv (' GB2312 ', ' Utf-8//ignore ', $content);//$content for the collected content
?> 

Tips: When using Iconv, if you want to use UTF-8 encoding, use UTF-8 instead of UTF8, because UTF8 some servers will have problems.

I hope this article will help you with your PHP program design.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.