in the work often encountered a lot of special punctuation, such as Chinese punctuation, English punctuation. English punctuation is easier to filter, while filtering Chinese punctuation is more troublesome. Here's how to filter special symbols from messages for reference.
Here is an example of a spam filter:
"Want to do/concurrently _ Grade/student _/, plus, I q:1 5." : R. !!?? 8 6. 0.2. 3 Yes, surprise, Hi, oh "
In the Mail, "!?. , "All in Chinese, and"/. " It's in English.
Here's how to filter:
<span style= "FONT-SIZE:18PX;" >#-*-coding:utf-8-*-import retemp = "Want to do/concurrently _ Grade/student _/, plus, I Q: 1 5. " 8 0. !!?? 8 6. 0. 2. 3 Yes, surprise, Hi, oh "temp = Temp.decode (" UTF8 ") string = Re.sub (" [\s+\.\!\/_,$%^* (+\ "\ ']+|[ +--! ,。? , [email protected]#¥%......&* ()]+ ". Decode (" UTF8 ")," ". Decode (" UTF8 "), temp) print string</span>
The following effects are filtered:
<span style= "FONT-SIZE:18PX;" > Want to be a part-time student plus I Q158086023 have a surprise oh </span>
after processing into the above format, it is easy to do word segmentation analysis processing.
Python filters Chinese and English punctuation special symbols