(1) Program analysis
1 #filename:word_freq.py2 #Note: Code style3 4 5 ImportArgparse6 ImportRe7 8 defProcess_file (DST):#read file to buffer9 Try:#Open FileTenFile=open (DST,'R', encoding='iso-8859-1') One exceptIOError as S: A Print('IOError', s) - returnNone - Try:#read file to buffer theBvffer=File.read () - except: - Print("Read File error!") - returnNone + file.close () - returnBvffer + #The function uses Iso-8859-1 method to read the TXT file, if using utf-8 read, "floating" in the beginning of the line will prompt error A at - - defProcess_buffer (bvffer): - ifBvffer: -Word_freq = {} - #add the processing buffer Bvffer code below to count the frequency of each word and store it in the dictionary word_freq inContent=Bvffer.lower () -Regex=re.compile ('[^a-za-z\s]') toContent_result=regex.sub ("', content) + Content_result.lower () - theContent_result.replace ('/ N',' ') *Content_result=content_result.split (' ') $ forWordinchContent_result:Panax NotoginsengWord_freq[word] = word_freq.get (Word, 0) + 1 - the returnWord_freq + #This function handles the bvffer of buffers, uses regular expressions to remove non-spaces and letters, and then divides uppercase into lowercase, and uses the Wors_freq dictionary to count the frequency of words. A the + - defOutput_result (word_freq): $ ifWord_freq: $Sorted_word_freq = sorted (Word_freq.items (), key=LambdaV:V[1], reverse=True) - forIteminchSORTED_WORD_FREQ[:10]:#output Top 10 words - Print(item) the #the first ten elements after sorting out the output - Wuyi the if __name__=="__main__": -Parser =Argparse. Argumentparser () WuParser.add_argument ('DST') -args =Parser.parse_args () AboutDST =ARGS.DST $Bvffer =Process_file (DST) -Word_freq =Process_buffer (Bvffer) - Output_result (word_freq) - #accepts the parameter DST and executes the most program entry sequentially
(2) Code style
(3) program Run command, run result
(4) Performance analysis results and improvement
- The part code that finds the execution time and the most number of times; (1 points)
You can see that the get () method of the dictionary executes 411,610 times, the word time reaches 0.054s, the visible overhead is very large.
- Try to improve the program code
1 for inch Content_result: 2 Word_freq[word] = word_freq.get (Word, 0) + 1
The original code, as above, discards the get () method and changes to the following code:
1 for inch Content_result: 2 # Word_freq[word] = word_freq.get (Word, 0) + 1 3 if word_freq.__contains__(word):4 Word_freq[word] = Word_ Freq[word] + 15 else:6 Word_freq[word] = 1
Use Cprofile to run as follows:
You can see that the number of executions is reduced by a small percentage (else branch shunt), but the word time is reduced to 0.044s, which reduces the overall execution time from 0.35s to 0.318s, so improving this code is an effective optimization.
Word frequency statistics for input files