30 Law : 30 Words with the highest frequency appear to be 30% of the total number of words in full text if you exclude 150 words with the highest frequency (because DF is considered a stop word): The total number of inverted table records will reduce 25-30% Zipf law: The product of all term freq (frequency) rankings and their freq (frequency) in a natural corpus is roughly a constant freq_no1 * 1 = Freq_no2 * 2 = FREQ_NO3 * 3 = Freq_non * n that is to say the second most of the word frequency is the first more Half, the third word frequency is the first 1/3, so and so on heaps law , in the natural corpus does not repeat the number of term and corpus data into an exponential relationship because it is an exponential relationship, you can know the following characteristics 1 document number infinitely increase, Do not repeat the number of the term will not tend to a constant 2 with the increase in the number of documents, the growth rate of non-repetition of the term will be reduced, the growth rate gradually tends to smooth
benfordLaw: in the naturally formed decimal data, The probability that the first digit d of any data appears is roughly log10 (1+1/d)
Several laws in linguistic statistics, which can be used as reference for design retrieval