Word frequency statistics for input files

Source: Internet
Author: User

(1) Program analysis
1 #filename:word_freq.py2 #Note: Code style3 4 5 ImportArgparse6 ImportRe7 8 defProcess_file (DST):#read file to buffer9     Try:#Open FileTenFile=open (DST,'R', encoding='iso-8859-1') One     exceptIOError as S: A         Print('IOError', s) -         returnNone -     Try:#read file to buffer theBvffer=File.read () -     except: -         Print("Read File error!") -         returnNone + file.close () -     returnBvffer + #The function uses Iso-8859-1 method to read the TXT file, if using utf-8 read, "floating" in the beginning of the line will prompt error A  at  -  - defProcess_buffer (bvffer): -     ifBvffer: -Word_freq = {} -         #add the processing buffer Bvffer code below to count the frequency of each word and store it in the dictionary word_freq inContent=Bvffer.lower () -Regex=re.compile ('[^a-za-z\s]') toContent_result=regex.sub ("', content) + Content_result.lower () -  theContent_result.replace ('/ N',' ') *Content_result=content_result.split (' ') $          forWordinchContent_result:Panax NotoginsengWord_freq[word] = word_freq.get (Word, 0) + 1 -      the         returnWord_freq + #This function handles the bvffer of buffers, uses regular expressions to remove non-spaces and letters, and then divides uppercase into lowercase, and uses the Wors_freq dictionary to count the frequency of words.  A  the  +  - defOutput_result (word_freq): $     ifWord_freq: $Sorted_word_freq = sorted (Word_freq.items (), key=LambdaV:V[1], reverse=True) -          forIteminchSORTED_WORD_FREQ[:10]:#output Top 10 words -             Print(item) the #the first ten elements after sorting out the output - Wuyi  the if __name__=="__main__": -Parser =Argparse. Argumentparser () WuParser.add_argument ('DST') -args =Parser.parse_args () AboutDST =ARGS.DST $Bvffer =Process_file (DST) -Word_freq =Process_buffer (Bvffer) - Output_result (word_freq) - #accepts the parameter DST and executes the most program entry sequentially

(2) Code style
    • Wrapping in parentheses the second line indents 4 spaces and applies to the case where the starting parenthesis wraps

(3) program Run command, run result

(4) Performance analysis results and improvement

    • The part code that finds the execution time and the most number of times; (1 points)

You can see that the get () method of the dictionary executes 411,610 times, the word time reaches 0.054s, the visible overhead is very large.

    • Try to improve the program code

1  for inch Content_result: 2    Word_freq[word] = word_freq.get (Word, 0) + 1

The original code, as above, discards the get () method and changes to the following code:

1   for inch Content_result: 2              # Word_freq[word] = word_freq.get (Word, 0) + 1 3              if word_freq.__contains__(word):4                  Word_freq[word] =  Word_ Freq[word] + 15              else:6                  Word_freq[word] = 1

Use Cprofile to run as follows:

You can see that the number of executions is reduced by a small percentage (else branch shunt), but the word time is reduced to 0.044s, which reduces the overall execution time from 0.35s to 0.318s, so improving this code is an effective optimization.

Word frequency statistics for input files

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.