In order to complete the assignments of the Web Search Course, I struggled for two days to implement the hierarchical clustering HAC algorithm and the clustering algorithm based on Affinity messages. To implement these two algorithms, the first thing is to compute the document vector. Specifically, the index words in the text set constitute a dimension of the vector space. In this way, M index words constitute M-dimensional feature vectors. STD: map is frequently used to construct feature vectors. Because I need to know the probability of this index word in a document. Here are some of my experiences to share with you:
1. OPERATOR []. This [] function is very effective, not only can reference the value corresponding to the key, but also the insert function. Demonstrate a basic usage first:
using namespace std;
...
map<string,int> elem;
....
//insert operation
...
//get inserted value
string keyword;
int freq = elem[keyword];
In this way, the value corresponding to the key in the map can be obtained! What should I do if the keyword I entered does not exist in this map? The [] Insert function is used. If the user fills in a keyword that the map does not have. OPERATOR [] can insert a new pair. And call the constructor of mapped data. Verify with code!
1 struct numidf
2 {
3 int num;
4 bool showup;
5 numidf ()
6 {
7 num = 0;
8 showup = false;
9 cout <"set to 0 and false" <Endl;
10}
11 };
12...
13 Map <string, numidf> m_idf;
14 // Insert elements
15...
16 // query Elements
17 string newkeyword; // The word m_idf does not contain
18 if (! M_idf [newkeyword]. showup)
19 {
20 cout <"construct a new one" <Endl;
21}
If the output of the above Code is
set to 0 and false
construct new one
That is to say, after a new key is input in [], map can automatically add a new pair. The key of the new pair is the entered newkeyword. Mapped data is the instance after initialization. This function is very good. I used to search for the find function first. If the function is new, manually add it. That would be complicated.
2. Use of map iterator
To be honest, I use fewer iterator. So I made several low-level mistakes. I will also remind myself of this article. The error is as follows: I want to implement a function similar to the following code.
vector<int> a;
for(int i = 0; i < a.size()-1; i++)
{
for(int j = i+1, j < a.size(); j++)
{
//some operation about i and j
}
}
I want to use iterator to implement the above functions, so I have the following tragic scene:
Map <string, int >:: iterator iteri;
....
// This is wrong!
Int I = 5;
Iteri = iteri + I;
// This is wrong!
I assume that iterator + will jump to the back. It cannot be compiled! A lot of errors have occurred !! Yes !!! So I used the following method:
1 map<string,int> m_Tree;
2 map<string,int>::iterator iterI = m_Tree.begin();
3 map<string,int>::iterator iterJ;
4 int i = 0;
5 for( ; i < m_Tree.size()-1; ++iterI,i++)
6 {
7 //iterJ = m_Tree.begin();
8 //advance(iterJ, i+1);
9 iterJ = iterI;
10 iterJ++;
11 for(; iterJ != m_Tree.end(); iterJ++)
12 {
13 float s = S((iterI->dvmap),(iterJ->dvmap));
14 if(s > mostSim)
15 {//this is the pair
16 mostSim = s;
17 sp.s1 = iterI;
18 sp.s2 = iterJ;
19 }
20 }
21 }
I want to get the next element pointed to by iteri, so I used the 9-and 10-row method. In fact, lines 7 and 8 of Code are also acceptable, but not as efficient as Lines 9 and 10! If you have a better way to bring it to this function, please let me know!
3. In terms of performance, do not let STD copy the memory. Pass the pointer!
A multi-dimensional document vector is used to calculate document similarity. This large vector is processed using STD: vector. I have noticed two points in terms of performance.
1) use reserve to apply for enough memory. Preparing for push_back
2) Pay attention to push_back. If a vector <float> is declared in the function body. The size of this vector is very large. This is when you want to give it push_back to private members of the class, it is necessary to copy a large amount of memory.
Based on the above two points, I used the following method
1 vector<float> dv;
2 pair<map<string,vector<float> >::iterator,bool> pr;
3 pr = m_TF_IDF.insert(pair<string,vector<float> >(filename, dv));
4 vector<float>& rkdv = pr.first->second;
5 rkdv.reserve(m_IDF.size());
M_tf_idf is a private member of the class. I have inserted an empty vector. Then, the reference of the empty vector is taken out, as shown in row 4th. Then we can use the reference of a large vector to push_back new data, thus eliminating the need for memory replication.
Pointer usage is a fast and efficient implementation method to avoid Memory replication. In my program, I do not know where to use the aforementioned huge vector <float>. To allow anyone who wants to use vector <float> to use it, I passed the pointer of vector <float>. I have defined the following struct:
struct dvPair
{
string names;
map<string,vector<float>*> dvmap;
};
I passed in the vector <float> pointer instead of the vector <float>!
That's all. No more.