When dealing with some crawled pages, it is often found that there are
�
This character. Try a variety of transcoding, no fruit.
Like what:
每个人对工� �的使用往往各有偏好
The corresponding text is this:
每个人对工具的使用往往各有偏好
How to deal with this situation, which is due to what problem arises?
If you need information, it looks like this can be seen.
Http://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BD
Reply content:
This character is often found when dealing with some crawled pages �
. Try a variety of transcoding, no fruit.
Like what:
每个人对工� �的使用往往各有偏好
The corresponding text is this:
每个人对工具的使用往往各有偏好
How to deal with this situation, which is due to what problem arises?
If you need information, it looks like this can be seen.
Http://apps.timwhitlock.info/unicode/inspect?s=%EF%BF%BD
when converting from one encoding to a Unicode encoding, if there is no corresponding character, the resulting Unicode code, "\uffffd", is �
this character.
This is your crawler does not recognize the original Web page encoding format (ASCII or GB2312, etc.) and compression format (gzip, etc.), all without brain to UTF-8 string caused, this character indicates that the conversion failed, the data has been lost, the character itself is not really meaningful.
If it is PHP, this may be caused by substring. The workaround is to install the Mb_string module, using the Mb_* series functions.
Sometimes a Chinese character is cut off a part will produce this symbol, such as a two-byte man was cut off a byte, how to deal with I do not know ...
You're supposed to be a ' with ' word divided into two parts.
public static function utf8Substr($str, $from, $len){ return preg_replace('#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'.$from.'}'. '((?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'.$len.'}).*#s', '$1',$str);}
I guess this is because the text is truncated when the line is broken, and when you chop the text, take a look at the line break. Did you break a text into two halves?
Maybe it's garbled.
You can use the browser to visit the normal try