Before the Chinese word segmentation statistics, often have to crawl down the text contained in some of the tags, punctuation, English letters, such as filtering out, this process is called data cleansing.
#Coding=utf-8ImportReImportCodecsdefstrs_filter (file): With Codecs.open (file,"R","UTF8") as F,codecs.open ("Result.txt","A +","UTF8") as C:lines=F.readlines () forLineinchlines:#line=line.decode (' UTF8 ')Re_html=re.compile ('<[^>]+>'. Decode ('UTF8'))#start matching from ' < ', not ' > ' characters are skipped until ' > 'Re_punc=re.compile ('[\s+\.\!\/_,$%^* (+\ "\ ']+|[ +--! ,。? , [email protected]#¥%......&*, "": ()]+'. Decode ('UTF8'))#Remove PunctuationRe_digits_letter=re.compile ('\w+'. Decode ('UTF8'))#remove numbers and lettersLine=re_html.sub ("', line) line=re_punc.sub ("", line) line=re_digits_letter.sub ("", line) C.write (line) Strs_filter ("Strip.txt")
The above code can be removed from the Chinese word segmentation statistics unrelated content, the effect is as follows:
Remove HTML tags from text, punctuation, numerals, and English words in English and Chinese