These operations on the text are often used, so let me summarize. Gradually added ... Action:
Strip_html (CLS, text) removes HTML tags
Separate_words (CLS, text, min_lenth=3) text extraction
Get_words_frequency (CLS, words_list) get word frequency Source:
Class Docprocess (object): @classmethod def strip_html (CLS, text): "" "Delete HTML tags in tex
T. Text is String "" "new_text =" is_html = False to character in text: if character = = "<": is_html = True elif character = = ">": Is_
HTML = False new_text = "" Elif is_html is false:new_text + = character return new_text @classmethod def separate_words (CLS, Text, min_lenth=3): "" "Separate Tex
T into the words in list. "" "Splitter = Re.compile (" \\w+ ") return [S.lower () to S in Splitter.split (text) If Len (s) > Min_lenth ] @classmethod def get_words_frequency (CLS, words_list): "" "get frequency of words in Words_
List.
Return a dict. "" "Num_words = {} for Word in words_list:num_Words[word] = num_words.get (Word, 0) + 1 return num_words