\data\
Ngram 1=10
Ngram 2=20
Ngram 3=30
\1-grams:
-2.522091 AH-0.4599362
-3.616682-0.2710813
-5.888154 ABA
-5.483542 Abu-0.02341532
-5.513821 Adidas-0.08972257
-5.357502 elder brother
-5.619849 Gelatin
-5.003489 Allah-0.0459251
-5.11305 Arabian-0.1348525
-5.11305 Arabic numerals-0.153861
\2-grams:
-2.841684-O-nan
-1.279527 Abuja
-0.7184195 Adidas </s>
-1.628645 Arahara
-1.628414 ALA
-1.272437 Alashan
-1.37447 Arabian Nobles
-1.122427 Arabs
-1.373596 Arabian numbers
-0.9671616 Arabic
\3-grams:
-0.7579774 Ah </s>
-0.3643477 Ah, yes.
-1.625012 Ah, yes.
-1.826232 Ah, yes, okay.
-0.1952119 Love </s>
-0.1937787 arrangement Ah </s>
-0.2185729 Safe </s>
-0.1328749 Installation </s>
-0.3589647 Bar </s>
-1.99777, bye.
* The above values are the base 10 logarithm value (the number in front of the phrase: probability, the data behind the phrase, the fallback weight)
The probability of calculating a sentence in the ARPA is as follows (3gram for example):
# Make sure the Oovs change <unk>
#P (word3| word1, Word2):
# if has (word3| word1, Word2) {
# return P (word3| word1, Word2);
#}else if has (word2| word1) {
# return Backoff (word2| word1) * P (word3| word2);
#}else{
# return P (word3| word2);
# }
#P (Word2 | word1):
# if has (word2| word1) {
# return P (word2| word1);
#}else{
# return Backoff (word1) * P (WORD2);
# }
Python implementation
def wordsprobs (words, dict):
Wordarr = Words.split ("")
If Len (wordarr) = = 3:
If Dict.has_key (words):
return Dict.get (words). Prob
Elif Dict.has_key (wordarr[0] + "" + wordarr[1]):
Return Dict.get (Wordarr[0] + "" + wordarr[1]). Backoff + wordsprobs (wordarr[1] + "" + wordarr[2], dict)
Else
Return Wordsprobs (Wordarr[1] + "" + wordarr[2], dict)
Elif len (wordarr) = = 2:
If Dict.has_key (wordarr[0] + "" + wordarr[1]):
Return Dict.get (Wordarr[0] + "" + wordarr[1]). Prob
Else
Return Dict.get (Wordarr[0]). Backoff + wordsprobs (wordarr[1], dict) #make sure OOV change to <unk>,or error
Else
Return Dict.get (Wordarr[0]). Prob #make sure OOV change to <unk>,or error
* Obtained by the above is Logp (3gramWords), probs = ten ^ Logp (3gramWords), probs is the final probability value of 3gramWords
The N-gram language model format of ARPA based on Srilm