Calculate document Similarity)

Source: Internet
Author: User
Tags strtok

From: http://blog.chinaunix.net/uid-26548237-id-3541783.html

1. vector space model
Vector space model, as a vector identifier, is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and related rules.
Documents and problems are expressed by vectors.

Each dimension is equivalent to an independent phrase. If this term appears in the document, its value in the vector is not zero. There are already many different methods to calculate these values. These values are called (phrases) weights. One of the well-known algorithms is the tf_idf weight. We define phrases based on applications. A Typical phrase is a single word, keyword, or long phrase. If a word is selected as a phrase, the dimension of the vector is the number of words that appear in the vocabulary. Vector operations can compare various documents through queries.

Based on the assumption of the document similarity theory, compare each document vector with the original query vector (the two vectors are of the same type . In fact, it is easier to calculate the cosine ratio of the angle between vectors to directly calculate the angle.

D 2 * Q is the document vector (I .e., D 2 ) And query vector (Q in ). The denominator is the modulus of two vectors. The model of a vector is calculated using the following formula:

Because all vectors in this model are strictly non-negative, if the cosine is zero, it indicates that the query vector and the document vector are orthogonal, that is, they do not match (in other words, the token is not found in the document. ), That is, the similarity between the two documents is 0%.

The following is an example of a TF-IDF weight.


Advantages:
Compared with the standard Boolean model, the vector space model has the following advantages:
1. Simple Model Based on Linear Algebra;
2. The weight of a phrase is not binary;
3. Continuous similarity between documents and indexes can be calculated;
4. allow files to be sorted based on possible relevance;
5. Local matching is allowed;

Limitations:
1. It is not applicable to long files because its similarity value is not ideal;
2. The search phrase must exactly match the phrase in the file;
3. files with the same context but different phrases cannot be associated due to poor semantic sensitivity;
4. the sequence in which phrases appear in the document cannot be expressed in the middle of the vector;
5. It is assumed that phrases are statistically independent;
6. the weights are obtained intuitively but not formally enough;

2. Use of Vector Space Models
The following describes how to use the vector space model to calculate the file similarity. The cosine value cosine described above is implemented as an example.
The weights in the implementation directly use the frequency of word appearance. In addition, the English similarity is compared here.

  1. # Include<Iostream>
  2. # Include<Map>
  3. # Include<Sys/Stat.H>
  4. # Include<Cmath>
  5. Using namespace std;
  6. # DefineError -1
  7. # Define OK 0
  8. # Define debug
  9. //Used to remove irrelevant words in text
  10. //ConstChar delim[] = ".,:;'/\"+I-_(){}[]<>*&^%$#@!?~/|\\=1234567890\T\N";
  11. Const char delim [] =".,:;'`/\"+-_ () {} [] <> * & ^ % $ #@!?~ /| \\= 1234567890 \ t \ n";
  12. Char*Strtolower(Char*Word)
  13. {
  14. Char*S;
  15. For(S=Word; *S;S++)
  16. {
  17. *S=Tolower(*S);
  18. }
  19. Return word;
  20. }
  21. IntReadfile(Char*Text_name,Map<String, Int> &Word_count)
  22. {
  23. Char*Str;
  24. Char*Word;
  25. Char*File;
  26. Struct stat sb;
  27. File*Fp=Fopen(Text_name, "R");
  28. If(Fp== Null)
  29. {
  30. ReturnError;
  31. }
  32. If(Stat(Text_name, &Sb))
  33. {
  34. ReturnError;
  35. }
  36. File= (Char*)Malloc(Sb.St_size);
  37. If(File== Null)
  38. {
  39. Fclose(Fp);
  40. ReturnError;
  41. }
  42. Fread(File,Sizeof(Char),Sb.St_size,Fp);
  43. Word=Strtok(File,Delim);
  44. While(Word!= Null)
  45. {
  46. //Delete the length of Word<=1
  47. If(Strlen(Word) <=1)
  48. {
  49. Word=Strtok(Null,Delim);
  50. Continue;
  51. }
  52. Str=Strtolower(Strdup(Word));
  53. StringTMP=Str;
  54. Word_count[TMP]++;
  55. Word=Strtok(Null,Delim);
  56. }
  57. }
  58. IntMain(IntArgc,Char**Argv)
  59. {
  60. Char*Text_name_one= "./Big.txt";
  61. //Char*Text_name_one= "./1.txt";
  62. Char*Text_name_two= "./Big.txt";
  63. //Char*Text_name_two= "./2.txt";
  64. Map<String, Int>Word_count_one;
  65. Map<String, Int>Word_count_two;
  66. Double multi_one=0.0;
  67. Double multi_two=0.0;
  68. Double multi_third=0.0;
  69. If(Readfile(Text_name_one,Word_count_one) == Error)
  70. {
  71. Cout<< "Readfile () error ." <<Endl;
  72. ReturnError;
  73. }
  74. # Ifdef debug
  75. Map<String, Int>::Iterator map_first=Word_count_one.Begin();
  76. For( ;Map_first!=Word_count_one.End();Map_first++)
  77. {
  78. Cout<<Map_first->First<< "" <<Map_first->Second <<Endl;
  79. }
  80. # Endif
  81. If(Readfile(Text_name_two,Word_count_two) == Error)
  82. {
  83. Cout<< "Readfile () error ." <<Endl;
  84. ReturnError;
  85. }
  86. # Ifdef debug
  87. Map<String, Int>::Iterator map_second=Word_count_two.Begin();
  88. For( ;Map_second!=Word_count_two.End();Map_second++)
  89. {
  90. Cout<<Map_second->First<< "" <<Map_second->Second <<Endl;
  91. }
  92. # Endif
  93. Map<String, Int>::Iterator map_one=Word_count_one.Begin();
  94. Map<String, Int>::Iterator map_tmp;
  95. For( ;Map_one!=Word_count_one.End();Map_one++)
  96. {
  97. Map_tmp=Word_count_two.Find(Map_one->First);
  98. If(Map_tmp==Word_count_two.End())
  99. {
  100. Multi_two+=Map_one->Second *Map_one->Second;
  101. Continue;
  102. }
  103. Multi_one+=Map_one->Second *Map_tmp->Second;
  104. Multi_two+=Map_one->Second *Map_one->Second;
  105. Multi_third+=Map_tmp->Second *Map_tmp->Second;
  106. Word_count_two.Erase(Map_one->First);//Delete
  107. }
  108. //Check whether there are still elements in 2
  109. For(Map_tmp=Word_count_two.Begin();Map_tmp!=Word_count_two.End();Map_tmp++)
  110. {
  111. Multi_third+=Map_tmp->Second *Map_tmp->Second;
  112. }
  113. Multi_two=SQRT(Multi_two);
  114. Multi_third=SQRT(Multi_third);
  115. Double result=Multi_one/ (Multi_two*Multi_third);
  116. Cout<< "Similarity :" <<Result*100<< "%" <<Endl;
  117. Return 0;
  118. }


Perform the following tests.
First, to detect two identical English text, text links for http://norvig.com/big.txt

The word statistics in the text show that the similarity between the two identical texts is 100%.

2. Text 1 content:... This is one! Content of text 2: () ...... This is Two

The running result is the same as the actual manual calculation result. The similarity between the two texts is 66.6667%.




The above is just a simple calculation of the similarity between two English texts. It is only calculated at the entry level and does not involve semantics. Therefore, it is relatively simple.
I am very interested in this aspect and will continue to learn other related content.


Introduction to theoretical knowledge: http://zh.wikipedia.org/wiki/%E5%90%91%E9%87%8F%E7%A9%BA%E9%96%93%E6%A8%A1%E5%9E%8B

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.