Let F (w) is the frequency of a word w in free text. Suppose that all the words of a text is ranked according to their frequency, and the most frequent word first. ZIPF's law states that the frequency of a word type was inversely proportional to it rank (i.e., FXR = k, for some const Ant k). For example, the 50th is common word type should occur three times as frequently as the 150th most common word type.
A. Write a function to process a large text and plot word frequency against word rank using Pylab.plot. Do you confirm Zipf ' s law? (Hint:it helps to use a logarithmic scale.) What's going on at the extreme ends of the plotted line?
B. Generate random text, e.g, using random.choice ("ABCDEFG"), taking care to include the space character. You'll need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, generate the Zipf plot as before, and compare the both plots. What does your make of Zipf ' s law in the light of this?
1 fromNltk.corpusImportGutenberg as GB2 3 defvalidate_zipf (text,ranklimit):4FDIST=NLTK. Freqdist ([w forWinchTextifW.isalpha ()])5x=Range (Ranklimit)6freq=[]7 forKeyinchFdist.keys ():8 freq.append (Fdist[key])9Y=sorted (freq,reverse=True) [: Ranklimit]Ten pylab.plot (x, y) One A defTest (): -Text=gb.words (fileids=['Shakespeare-hamlet.txt']) -VALIDATE_ZIPF (text,150) the
The result of the operation is:
Zipf ' s Law