Termshashperfield class
 
 
 
 
 
 
 
I. Category function Overview:
 
 
 
Responsible for the indexing process of word items. Each field has a corresponding termshashperfield. When a field is indexed, The add () function corresponding to termshashperfield is used to complete the (one) Word item indexing process, the index content (word string, pointer information, and location information) is stored in the memory buffer.
 
 
 
 
 
 
 
Ii. class member description:
 
 
 
 
 
 
 
2.1 final int streamcount;
 
 
 
If you need to record the word frequency and position, this value is 2 (Use two int records to offset the pointer), otherwise it is 1
 
 
 
 
 
 
 
//CodeAs follows:
 
 
 
Streamcount = consumer. getstreamcount (); // freqproxtermswriterperfield //
 
 
 
 
 
 
 
2.2 termattribute termatt;
 
 
 
Word items after word segmentation are stored here, So word items are retrieved from termatt during indexing.
 
 
 
 
 
 
 
2.3 rawpostinglist [] and rawpostinglist
 
 
 
The Code is as follows:
 
 
 
Private rawpostinglist [] postingshash = new rawpostinglist [postingshashsize];
 
 
 
Private rawpostinglist P;
 
 
 
 
 
 
Each p records the offset address, document number, Document Frequency (DF), and location information of the term.
 
 
 
And stored in postingshash. The subscript is the hash value.
 
 
 
 
 
 
 
The offset address is saved.
 
 
 
Abstract class rawpostinglist {
 
 
 
Int textstart;
 
 
 
Int intstart;
 
 
 
Int bytestart;
 
 
 
}
 
 
 
 
 
 
 
But the actual storage is freqproxtermswriter. postinglist
 
 
 
The Code is as follows:
 
 
 
Static final class postinglist extends rawpostinglist {
 
 
 
Int docfreq; // # times this term occurs in the current Doc
 
 
 
Int lastdocid; // last docid where this term occurred
 
 
 
Int lastdoccode; // code for prior Doc
 
 
 
Int lastposition; // last position where this term occurred
 
 
}
 
 
 
 
 
 
 
Therefore, postingshash actually stores
 
 
 
[0] = freqproxtermswriter $ postinglist (ID = 23)
 
 
 
Bytestart = 0
 
 
 
Docfreq = 1
 
 
 
Intstart = 0
 
 
 
Lastdoccode = 0
 
 
 
Lastdocid = 0
 
 
 
Lastposition = 0
 
 
Textstart = 0
 
 
 
 
 
 
 
Postingshash stores some index information of all word items in this field. It is hashed to postingshash Based on the encoding value of the word item character as the initial value.
 
 
 
The Code is as follows:
 
 
 
Int hashpos = Code & postingshashmask;
 
 
 
 
 
 
 
If an address conflict occurs, use a fixed step to increase the encoding value, find an empty location, and enter P. This step is
 
 
 
 
 
 
 
Final int Inc = (code> 8) + code) | 1;
 
 
 
 
 
 
 
The Code is as follows:
 
 
 
 
 
 
 
Do {
 
 
 
Code + = Inc;
 
 
 
Hashpos = Code & postingshashmask;
 
 
 
P = postingshash [hashpos];
 
 
 
} While (P! = NULL &&! Postingequals (tokentext, tokentextlen ));
 
 
 
 
 
 
 
 
 
 
In addition, if the postingshash usage exceeds half (the usage is adjustable), it will be extended
 
 
 
The Code is as follows:
 
 
 
If (numpostings = postingshashhalfsize)
 
 
 
Rehashpostings (2 * postingshashsize );
 
 
 
 
 
 
 
2.2 String Buffer
 
 
 
 
 
 
 
The following three caches the index content, which are obtained from documentswriter. The index structure in this buffer is not the index content structure written to the disk.
 
 
 
Intpool = perthread. intpool;
 
 
 
Charpool = perthread. charpool;
 
 
 
Bytepool = perthread. bytepool;
 
 
 
 
 
 
 
2.2.1 final charblockpool charpool ;-
 
 
 
Store word item content (string content)
 
 
 
For continuous storage, the adjacent word items are separated by 1 char, that is, the textlen + 1 length is stored.
 
 
 
The string pointer records the start position of each word item (no record length and end position, only the length is known through the start position of the adjacent string)
 
 
 
 
 
 
 
For example
 
 
 
 
 
 
China South University of Science and Technology 2 0 0 9 1 0 1 9 0 1 4 6
 
 
 
|- |-|
 
 
 
 
 
 
 
2.2.2 final intblockpool intpool;
 
 
 
 
 
 
 
 
 
 
 
The offset value of location information is stored.
 
 
 
 
 
 
 
For (INT I = 0; I <streamcount; I ++)
 
 
 
{
 
 
 
// Intuptos [intuptostart + 1] records the location information offset value,
 
 
 
// Intuptos [intuptostart + 0] not clear
 
 
 
Final int upto = bytepool. newslice (byteblockpool. first_level_size );
 
 
 
Intuptos [intuptostart + I] = upto + bytepool. byteoffset;
 
 
 
}
 
 
 
 
 
 
 
2 (streamcount) stores a word item, that is, each word item records two pointers
 
 
 
 
 
 
 
2.2.3 final byteblockpool bytepool;
 
 
 
 
 
 
Write word item Location Information
 
 
 
 
 
 
 
Store a word item location every 10 minutes (After encoding)
 
 
 
|-| --- |
 
 
 
0-5-10-15
 
 
 
 
 
 
 
The value of interval 10 is determined by the following:
 
 
 
1. byteblockpool. first_level_size = 5
 
 
 
2. mentioned: streamcount = 2 when the frequency and position of word items need to be saved
 
 
 
The Code is as follows:
 
 
 
For (INT I = 0; I <streamcount; I ++)
 
 
 
{
 
 
 
Final int upto = bytepool. newslice (byteblockpool. first_level_size );
 
 
 
Intuptos [intuptostart + I] = upto + bytepool. byteoffset;
 
 
 
}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Iii. class member functions
 
 
 
 
 
 
 
3.1 Constructor
 
 
 
Public termshashperfield (docinverterperfield,
 
 
Final termshashperthread perthread,
 
 
 
Final termshashperthread nextperthread,
 
 
 
Final fieldinfo)
 
 
 
 
 
 
 
Some index global variables are input during the constructor
 
 
 
Final fieldinfo is the field information
 
 
 
 
 
 
 
Incoming Buffer Memory Pool
 
 
 
Intpool = perthread. intpool;
 
 
 
Charpool = perthread. charpool;
 
 
 
Bytepool = perthread. bytepool;
 
 
 
 
 
 
 
 
 
 
3.2 Indexing Process
 
 
 
The index process is completed by the add () function.
 
 
 
Void add () throws ioexception {}
 
 
 
 
 
 
 
 
 
 
 
Void add () throws ioexception {
 
 
 
 
 
 
 
// Obtain the word content and length after word segmentation
 
 
 
Final char [] tokentext = termatt. termbuffer ();
 
 
 
Final int tokentextlen = termatt. termlength ();
 
 
 
 
 
 
 
 
 
 
// Calculate the postingshash position and use the word term character for encoding.
 
 
 
Int downto = tokentextlen;
 
 
 
Int code = 0;
 
 
 
 
 
 
While (downto> 0 ){
 
 
 
 
 
// Starting from high
 
 
 
 
 
 
Char CH = tokentext [-- downto];
 
 
 
 
 
 
// Saves Unicode in the middle
 
 
 
 
 
 
Code = (Code * 31) + CH; // character encoding
 
 
 
} // While (downto> 0)
 
 
 
 
 
 
 
// Final encoding of word items
 
 
 
 
 
 
Int hashpos = Code & postingshashmask;
 
 
 
 
 
 
 
// Locate rawpostinglist in hash
 
 
 
P = postingshash [hashpos];
 
 
 
 
 
 
 
// If the HA seat is filled, traverse to find the next vacant space
 
 
If (P! = NULL &&! Postingequals (tokentext, tokentextlen )){
 
 
 
// Conflict: Keep searching different locations in
 
 
 
// The hash table.
 
 
 
Final int Inc = (code> 8) + code) | 1;
 
 
 
Do {
 
 
 
Code + = Inc;
 
 
 
Hashpos = Code & postingshashmask;
 
 
 
P = postingshash [hashpos];
 
 
 
} While (P! = NULL &&! Postingequals (tokentext, tokentextlen ));
 
 
 
}
 
 
 
 
 
 
 
// Locate a position that can be filled (postingshash has never indexed this term)
 
 
If (P = NULL) {// The First Time P is null, because there is no hash conflict
 
 
 
 
 
 
 
Final int textlen1 = 1 + tokentextlen;
 
 
 
 
 
 
// The buffer for storing word item characters is full, extended
 
 
 
If (textlen1 + charpool. charupto> documentswriter. char_block_size) // 16384
 
 
 
{
 
 
 
If (textlen1> documentswriter. char_block_size ){
 
 
 
 
 
 
 
If (docstate. maxtermprefix = NULL)
 
 
 
Docstate. maxtermprefix = new string (tokentext, 0, 30 );
 
 
 
 
 
 
Consumer. skippinglongterm ();
 
 
 
Return;
 
 
 
}
 
 
 
Charpool. nextbuffer ();
 
 
 
}
 
 
 
 
 
 
 
// Pull next free rawpostinglist from Free List
 
 
 
P = perthread. freepostings [-- perthread. freepostingscount];
 
 
 
 
 
 
 
 
 
 
Final char [] Text = charpool. buffer;
 
 
 
Final int textupto = charpool. charupto;
 
 
 
 
 
 
 
// Record the offset address value of the word term character
 
 
P. textstart = textupto + charpool. charoffset;
 
 
 
Charpool. charupto + = textlen1 (original character Length + 1 );
 
 
 
 
 
 
 
// Copy the word term characters to the buffer
 
 
 
System. arraycopy (tokentext, 0, text, textupto, tokentextlen );
 
 
 
Text [textupto + tokentextlen] = 0 xFFFF;
 
 
 
 
 
 
Assert postingshash [hashpos] = NULL;
 
 
 
Postingshash [hashpos] = P;
 
 
 
Numpostings ++;
 
 
 
 
 
 
 
// Postingshash expands if the usage rate exceeds half
 
 
 
If (numpostings = postingshashhalfsize)
 
 
Rehashpostings (2 * postingshashsize );
 
 
 
 
 
 
 
// Init stream slices
 
 
 
If (numpostingint + intpool. intupto> documentswriter. int_block_size)
 
 
 
Intpool. nextbuffer (); // insufficient memory, Application
 
 
 
 
 
 
 
If (documentswriter. byte_block_size-bytepool. byteupto <numpostingint * byteblockpool. first_level_size)
 
 
 
Bytepool. nextbuffer (); // insufficient memory, Application
 
 
 
 
 
 
 
Intuptos = intpool. buffer; // record offset
 
 
 
Intuptostart = intpool. intupto;
 
 
Intpool. intupto + = streamcount; // use the streamcount offset value for each word item
 
 
 
 
 
 
 
P. intstart = intuptostart + intpool. effecffset;
 
 
 
 
 
 
 
For (INT I = 0; I <streamcount; I ++)
 
 
 
{
 
 
 
Final int upto = bytepool. newslice (byteblockpool. first_level_size );
 
 
 
Intuptos [intuptostart + I] = upto + bytepool. byteoffset;
 
 
 
}
 
 
 
 
 
 
P. bytestart = intuptos [intuptostart];
 
 
 
 
 
 
 
Consumer. newterm (P );
 
 
 
 
 
 
} Else {
 
 
 
// This indicates that this term has been indexed (again in the original document)
 
 
 
 
 
 
Intuptos = intpool. buffers [P. intstart> documentswriter. int_block_shift];
 
 
 
Intuptostart = P. intstart & documentswriter. int_block_mask;
 
 
 
Consumer. addterm (p); // Add a posting list and compress the code
 
 
 
}
 
 
 
}
 
 
 
 
 
 
 
Experiment 4
 
 
 
The following data supports the above parsing and indexing documents,
 
 
 
The content of document 1 is "Sina news", and the content of document 2 is "Lianhe Zaobao"
 
 
 
 
 
 
 
Adding D: \ file \ 2.txt
 
 
 
New
 
 
 
Intuptostart = 16
 
 
 
Intpool. intupto = 18
 
 
 
P. intstart = 16
 
 
 
Intuptos [intuptostart + I] = 80
 
 
Intuptos [intuptostart + I] = 85
 
 
 
P. bytestart = 80
 
 
 
Bytes [85] = 0
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytestart = 80
 
 
 
Docfreq = 1
 
 
 
Intstart = 16
 
 
 
Lastdoccode = 2
 
 
 
Lastdocid = 1
 
 
 
Lastposition = 0
 
 
 
Textstart = 39
 
 
 
 
 
 
 
Termshashperfield. writevint (1, 0)
 
 
 
 
 
 
 
Lang
 
 
 
Intuptostart = 18
 
 
 
Intpool. intupto = 20
 
 
 
P. intstart = 18
 
 
 
Intuptos [intuptostart + I] = 90
 
 
 
Intuptos [intuptostart + I] = 95
 
 
 
P. bytestart = 90
 
 
 
Bytes [95] = 2
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytestart = 90
 
 
 
Docfreq = 1
 
 
Intstart = 18
 
 
 
Lastdoccode = 2
 
 
 
Lastdocid = 1
 
 
 
Lastposition = 0
 
 
 
Textstart = 41
 
 
 
 
 
 
 
Termshashperfield. writevint (1, 1)
 
 
 
 
 
 
 
New (Note: This term appears for the second time)
 
 
 
Intuptostart = 16
 
 
 
Intpool. intupto = 20
 
 
 
P. intstart = 16
 
 
 
Bytes [86] = 4 // continuously store word item Location Information
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytestart = 80
 
 
 
Docfreq = 2 // The document DF is set to 2
 
 
 
Intstart = 16
 
 
 
Lastdoccode = 2
 
 
 
Lastdocid = 1
 
 
 
Lastposition = 0
 
 
 
Textstart = 39
 
 
 
 
 
 
 
Termshashperfield. writevint (1, 2)
 
 
 
 
 
 
 
Wen
 
 
 
Intuptostart = 20
 
 
Intpool. intupto = 22
 
 
 
P. intstart = 20
 
 
 
Intuptos [intuptostart + I] = 100
 
 
 
Intuptos [intuptostart + I] = 105
 
 
 
P. bytestart = 100
 
 
 
Bytes [105] = 6
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytes start = 100
 
 
 
Docfreq = 1
 
 
 
Intstart = 20
 
 
 
Lastdoccode = 2
 
 
 
Lastdocid = 1
 
 
 
Lastposition = 0
 
 
 
Textstart = 43
 
 
 
 
 
 
 
Termshashperfield. writevint (1, 3)
 
 
 
 
 
 
 
200910190147
 
 
 
Intuptostart = 22
 
 
 
Intpool. intupto = 24
 
 
 
P. intstart = 22
 
 
 
Intuptos [intuptostart + I] = 110
 
 
 
Intuptos [intuptostart + I] = 115
 
 
 
P. bytestart = 110
 
 
 
Bytes [115] = 0
 
 
 
 
 
 
Bytes start = 110
 
 
 
Docfreq = 1
 
 
 
Intstart = 22
 
 
 
Lastdoccode = 2
 
 
 
Lastdocid = 1
 
 
 
Lastposition = 0
 
 
 
Textstart = 45
 
 
 
 
 
 
 
 
 
 
 
D: \ file \ 2.txt
 
 
 
Intuptostart = 24
 
 
 
Intpool. intupto = 26
 
 
 
P. intstart = 24
 
 
 
Intuptos [intuptostart + I] = 120
 
 
 
Intuptos [intuptostart + I] = 125
 
 
 
P. bytestart = 120
 
 
 
Bytes [125] = 0
 
 
 
 
 
 
 
Bytes start = 120
 
 
 
Docfreq = 1
 
 
 
Intstart = 24
 
 
 
Lastdoccode = 2
 
 
 
Lastdocid = 1
 
 
 
Lastposition = 0
 
 
 
Textstart = 58
 
 
 
 
 
 
 
Adding D: \ file \ 3.txt
 
 
Connection
 
 
 
Intuptostart = 26
 
 
 
Intpool. intupto = 28
 
 
 
P. intstart = 26
 
 
 
Intuptos [intuptostart + I] = 130
 
 
 
Intuptos [intuptostart + I] = 135
 
 
 
P. bytestart = 130
 
 
 
Bytes [135] = 0
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytes start = 130
 
 
 
Docfreq = 1
 
 
 
Intstart = 26
 
 
 
Lastdoccode = 4
 
 
 
Lastdocid = 2
 
 
 
Lastposition = 0
 
 
 
Textstart = 72
 
 
 
 
 
 
 
 
 
 
 
Integration
 
 
 
Intuptostart = 28
 
 
 
Intpool. intupto = 30
 
 
 
P. intstart = 28
 
 
 
Intuptos [intuptostart + I] = 140
 
 
 
Intuptos [intuptostart + I] = 145
 
 
 
P. bytestart = 140
 
 
 
Bytes [145] = 2
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytes start = 140
 
 
 
Docfreq = 1
 
 
 
Intstart = 28
 
 
 
Lastdoccode = 4
 
 
 
Lastdocid = 2
 
 
 
Lastposition = 0
 
 
 
Textstart = 74
 
 
 
 
 
 
 
 
 
 
 
Early
 
 
 
Intuptostart = 30
 
 
 
Intpool. intupto = 32
 
 
 
P. intstart = 30
 
 
 
Intuptos [intuptostart + I] = 150
 
 
 
Intuptos [intuptostart + I] = 155
 
 
 
P. bytestart = 150
 
 
 
Bytes [155] = 4
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytes start = 150
 
 
 
Docfreq = 1
 
 
 
Intstart = 30
 
 
 
Lastdoccode = 4
 
 
 
Lastdocid = 2
 
 
 
Lastposition = 0
 
 
 
Textstart = 76
 
 
 
 
 
 
 
 
 
 
Report
 
 
 
Intuptostart = 32
 
 
 
Intpool. intupto = 34
 
 
 
P. intstart = 32
 
 
 
Intuptos [intuptostart + I] = 160
 
 
 
Intuptos [intuptostart + I] = 165
 
 
 
P. bytestart = 160
 
 
 
Bytes [165] = 6
 
 
 
 
 
 
 
The value of rawpostinglist P is as follows:
 
 
 
Bytes start = 160
 
 
 
Docfreq = 1
 
 
 
Intstart = 32
 
 
 
Lastdoccode = 4
 
 
 
Lastdocid = 2
 
 
 
Lastposition = 0
 
 
 
Textstart = 78
 
 
 
 
 
 
 
 
 
 
 
200910190148
 
 
 
Intuptostart = 34
 
 
 
Intpool. intupto = 36
 
 
 
P. intstart = 34
 
 
 
Intuptos [intuptostart + I] = 170
 
 
 
Intuptos [intuptostart + I] = 175
 
 
P. bytestart = 170
 
 
 
Bytes [175] = 0
 
 
 
 
 
 
 
D: \ file \ 3.txt
 
 
 
Intuptostart = 36
 
 
 
Intpool. intupto = 38
 
 
 
P. intstart = 36
 
 
 
Intuptos [intuptostart + I] = 180
 
 
 
Intuptos [intuptostart + I] = 185
 
 
 
P. bytestart = 180
 
 
 
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/todaylxp/archive/2009/10/25/4726017.aspx