When dealing with Chinese strings, how to handle this character

Source: Internet
Author: User
When dealing with some crawled pages, it is often found that there are This character. Try a variety of transcoding, no fruit.

Like what:

每个人对工� �的使用往往各有偏好

The corresponding text is this:

每个人对工具的使用往往各有偏好

How to deal with this situation, which is due to what problem arises?

If you need information, it looks like this can be seen.

Http://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BD

Reply content:

This character is often found when dealing with some crawled pages . Try a variety of transcoding, no fruit.

Like what:

每个人对工� �的使用往往各有偏好

The corresponding text is this:

每个人对工具的使用往往各有偏好

How to deal with this situation, which is due to what problem arises?

If you need information, it looks like this can be seen.

Http://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BD

when converting from one encoding to a Unicode encoding, if there is no corresponding character, the resulting Unicode code, "\uffffd", is this character.
This is your crawler does not recognize the original Web page encoding format (ASCII or GB2312, etc.) and compression format (gzip, etc.), all without brain to UTF-8 string caused, this character indicates that the conversion failed, the data has been lost, the character itself is not really meaningful.

If it is PHP, this may be caused by substring. The workaround is to install the Mb_string module, using the Mb_* series functions.

Sometimes a Chinese character is cut off a part will produce this symbol, such as a two-byte man was cut off a byte, how to deal with I do not know ...
You're supposed to be a ' with ' word divided into two parts.

public static function utf8Substr($str, $from, $len){    return preg_replace('#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'.$from.'}'.        '((?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'.$len.'}).*#s',        '$1',$str);}

I guess this is because the text is truncated when the line is broken, and when you chop the text, take a look at the line break. Did you break a text into two halves?

Maybe it's garbled.
You can use the browser to visit the normal try

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.