Analysis and Application of cosine theorem and SimHash of Text Similarity Algorithm in. NET
Cosine similarity principle: first, we first split two paragraphs of text to list all words, then calculate the word frequency of each word, and finally convert words into vectors, in this way, we only need to calculate the similarity between the two vectors. let's briefly describe the following text 1: I/love/Beijing/Tiananmen Square/word segmentation to obtain the word frequency vector (pseudo vector) [,] Text 2: we/All/Beijing/Tiananmen Square/obtain the word frequency vector after word segmentation (pseudo vector) [,]. We can think of them as two line segments in space, all are from the origin ([0, 0,...]) and points to different directions. An angle is formed between two line segments. If the angle is 0 degrees, the direction is the same and the line segments overlap. If the angle is 90 degrees, the angle is formed and the direction is completely different; if the angle is 180 degrees, it means the opposite direction. Therefore, we can determine the similarity between vectors by the angle. The smaller the angle, the more similar it is. C # core algorithm public class TFIDFMeasure {private string [] _ docs; private string [] [] _ ngramDoc; private int _ numDocs = 0; private int _ numTerms = 0; private ArrayList _ terms; private int [] [] _ termFreq; private float [] [] _ termWeight; private int [] _ maxTermFreq; private int [] _ docFreq; public class TermVector {public static float ComputeCosineSimilarity (float [] vector1, float [] vector2) {if (vector1.Length! = Vector2.Length) throw new Exception ("difer length"); float denom = (VectorLength (vector1) * VectorLength (vector2); if (denom = 0F) return 0F; else return (InnerProduct (vector1, vector2)/denom);} public static float InnerProduct (float [] vector1, float [] vector2) {if (vector1.Length! = Vector2.Length) throw new Exception ("differ length are not allowed"); float result = 0F; for (int I = 0; I <vector1.Length; I ++) result + = vector1 [I] * vector2 [I]; return result;} public static float VectorLength (float [] vector) {float sum = 0.0F; for (int I = 0; I <vector. length; I ++) sum = sum + (vector [I] * vector [I]); return (float) Math. sqrt (sum) ;}} private IDictionary _ wordsIndex = new Hashtable (); Public TFIDFMeasure (string [] documents) {_ docs = documents; _ numDocs = documents. length; MyInit ();} private void GeneratNgramText () {} private ArrayList GenerateTerms (string [] docs) {ArrayList uniques = new ArrayList (); _ ngramDoc = new string [_ numDocs] []; for (int I = 0; I <docs. length; I ++) {Tokeniser tokenizer = new Tokeniser (); string [] words = tokenizer. partition (docs [I]); for (int j = 0; j <wo Rds. Length; j ++) if (! Uniques. contains (words [j]) uniques. add (words [j]);} return uniques;} private static object AddElement (IDictionary collection, object key, object newValue) {object element = collection [key]; collection [key] = newValue; return element;} private int GetTermIndex (string term) {object index = _ wordsIndex [term]; if (index = null) return-1; return (int) index;} private void MyInit () {_ terms = GenerateT Erms (_ docs); _ numTerms = _ terms. count; _ maxTermFreq = new int [_ numDocs]; _ docFreq = new int [_ numTerms]; _ termFreq = new int [_ numTerms] []; _ termWeight = new float [_ numTerms] []; for (int I = 0; I <_ terms. count; I ++) {_ termWeight [I] = new float [_ numDocs]; _ termFreq [I] = new int [_ numDocs]; AddElement (_ wordsIndex, _ terms [I], I);} GenerateTermFrequency (); GenerateTermWeight ();} private float Log (float nu M) {return (float) Math. log (num); // log2} private void GenerateTermFrequency () {for (int I = 0; I <_ numDocs; I ++) {string curDoc = _ docs [I]; IDictionary freq = GetWordFrequency (curDoc); IDictionaryEnumerator enums = freq. getEnumerator (); _ maxTermFreq [I] = int. minValue; while (enums. moveNext () {string word = (string) enums. key; int wordFreq = (int) enums. value; int termIndex = GetTermIndex (word); _ termFreq [TermIndex] [I] = wordFreq; _ docFreq [termIndex] ++; if (wordFreq> _ maxTermFreq [I]) _ maxTermFreq [I] = wordFreq ;}}} private void GenerateTermWeight () {for (int I = 0; I <_ numTerms; I ++) {for (int j = 0; j <_ numDocs; j ++) _ termWeight [I] [j] = ComputeTermWeight (I, j);} private float GetTermFrequency (int term, int doc) {int freq = _ termFreq [term] [doc]; int maxfreq = _ maxTermFreq [doc]; return (float) f Req/(float) maxfreq);} private float GetInverseDocumentFrequency (int term) {int df = _ docFreq [term]; return Log (float) (_ numDocs)/(float) df);} private float ComputeTermWeight (int term, int doc) {float tf = GetTermFrequency (term, doc); float idf = GetInverseDocumentFrequency (term); return tf * idf ;} private float [] GetTermVector (int doc) {float [] w = new float [_ numTerms]; for (int I = 0; I <_ nu MTerms; I ++) w [I] = _ termWeight [I] [doc]; return w;} public float GetSimilarity (int doc_ I, int doc_j) {float [] vector1 = GetTermVector (doc_ I); float [] vector2 = GetTermVector (doc_j); return TermVector. computeCosineSimilarity (vector1, vector2);} private IDictionary GetWordFrequency (string input) {string convertedInput = input. toLower (); Tokeniser tokenizer = new Tokeniser (); String [] words = tokenizer. Partition (convertedInput); Array. sort (words); String [] distinctWords = GetDistinctWords (words); IDictionary result = new Hashtable (); for (int I = 0; I <distinctWords. length; I ++) {object tmp; tmp = CountWords (distinctWords [I], words); result [distinctWords [I] = tmp;} return result ;} private string [] GetDistinctWords (String [] input) {if (input = null) return new string [0]; else {ArrayList list = new Ar RayList (); for (int I = 0; I <input. Length; I ++) if (! List. Contains (input [I]) // N-GRAM SIMILARITY? List. add (input [I]); return Tokeniser. arrayListToArray (list) ;}} private int CountWords (string word, string [] words) {int itemIdx = Array. binarySearch (words, word); if (itemIdx> 0) while (itemIdx> 0 & words [itemIdx]. equals (word) itemIdx --; int count = 0; while (itemIdx <words. length & itemIdx> = 0) {if (words [itemIdx]. equals (word) count ++; itemIdx ++; if (itemIdx <words. length) if (! Words [itemIdx]. equals (word) break;} return count;} The disadvantage is that the whole vector dimension is very high due to the fact that an article has many feature vector quantifiers, this makes computing too costly and does not fit into the calculation of large data volumes. The main idea of the SimHash algorithm is to reduce dimensions and map high-dimensional feature vectors into an f-bit fingerprint (fingerprint ), compare the Hamming Distance of the f-bit fingerprints of the two articles to determine whether the articles are repeated or highly similar. Since we can compute Hamming Distance for every article to save it, we can use Hamming Distance to calculate it directly, so the speed is very fast and suitable for big data computing. Google uses this algorithm to re-query webpage files. Suppose we have the following three paragraphs: 1, the cat sat on the mat 2, the cat sat on a mat 3, and we all scream for ice cream. How can we implement this hash algorithm? Take the preceding three texts as an example. The entire process can be divided into the following six steps: 1. Select the number of simhash digits. Consider the storage cost and the size of the dataset, for example, 32-bit 2. initialize simhash to 0. 3. Extract the features in the original text and use various word segmentation methods. For example, for "the cat sat on the mat", the following results are obtained by means of Word Segmentation: {"th", "he", "e", "c ", "ca", "at", "t", "s", "sa", "o", "on", "n", "t", "m ", "ma"} 4. Use the traditional 32-bit hash function to calculate the hashcode of each word, for example, "th ". hash =-502157718, "he ". hash =-369049682 ,...... 5. For each hashcode bit of each word, if this bit is 1, the value of the corresponding bit of simhash is added to 1; otherwise, subtract 1 6. For the last 32-bit simhash, if this bit is greater than 1, set it to 1; otherwise, set it to 0.