Dice's Coefficient(Also known asDICE coefficient) Is a similarity measure related to the jaccard index.
For sets X and Y of keywords used in information retrieval, the coefficient may be defined:[1]
When taken as a string similarity measure, the coefficient may be calculated for two strings,XAndYUsing bigrams as follows:[2]
WhereNTIs the number of character bigrams found in both strings,NXIs the number of bigrams in stringXAndNYIs the number of bigrams in stringY. For example, to calculate the similarity:
-
night
-
nacht
We wocould find the set of bigrams in each word:
-
{
ni
,
ig
,
gh
,
ht
}
-
{
na
,
ac
,
ch
,
ht
}
Each set has 4 elements, and the intersection of these two sets has only one element:ht
.
Plugging this into the formula, we calculate,S= (2*1)/(4 + 4) = 0.25
See also
- Jaccard Index
- Levenshtein distance
- Sø rensen similarity index
Notes
- ^C. J. van rijsbergen (1979)
- ^Kondrak, G. et al. (2003)
References
- C. J. van rijsbergen (1979) Information Retrieval (London: Butterworths)
- Kondrak, G., Marcu, D. and Knight, K. (2003) "cognates can improve statistical translation models" inProceedings of HLT-NAACL 2003: human language technology conference of the North American chapter of the Association for Computational Linguistics, Pp. 46--48
Retrieved from "http://en.wikipedia.org/wiki/Dice%27s_coefficient"