Termshashperfield class
I. Category function Overview:
Responsible for the indexing process of word items. Each field has a corresponding termshashperfield. When a field is indexed, The add () function corresponding to termshashperfield is used to complete the (one) Word item indexing process, the index content (word string, pointer information, and location information) is stored in the memory buffer.
Ii. class member description:
2.1 final int streamcount;
If you need to record the word frequency and position, this value is 2 (Use two int records to offset the pointer), otherwise it is 1
//CodeAs follows:
Streamcount = consumer. getstreamcount (); // freqproxtermswriterperfield //
2.2 termattribute termatt;
Word items after word segmentation are stored here, So word items are retrieved from termatt during indexing.
2.3 rawpostinglist [] and rawpostinglist
The Code is as follows:
Private rawpostinglist [] postingshash = new rawpostinglist [postingshashsize];
Private rawpostinglist P;
Each p records the offset address, document number, Document Frequency (DF), and location information of the term.
And stored in postingshash. The subscript is the hash value.
The offset address is saved.
Abstract class rawpostinglist {
Int textstart;
Int intstart;
Int bytestart;
}
But the actual storage is freqproxtermswriter. postinglist
The Code is as follows:
Static final class postinglist extends rawpostinglist {
Int docfreq; // # times this term occurs in the current Doc
Int lastdocid; // last docid where this term occurred
Int lastdoccode; // code for prior Doc
Int lastposition; // last position where this term occurred
}
Therefore, postingshash actually stores
[0] = freqproxtermswriter $ postinglist (ID = 23)
Bytestart = 0
Docfreq = 1
Intstart = 0
Lastdoccode = 0
Lastdocid = 0
Lastposition = 0
Textstart = 0
Postingshash stores some index information of all word items in this field. It is hashed to postingshash Based on the encoding value of the word item character as the initial value.
The Code is as follows:
Int hashpos = Code & postingshashmask;
If an address conflict occurs, use a fixed step to increase the encoding value, find an empty location, and enter P. This step is
Final int Inc = (code> 8) + code) | 1;
The Code is as follows:
Do {
Code + = Inc;
Hashpos = Code & postingshashmask;
P = postingshash [hashpos];
} While (P! = NULL &&! Postingequals (tokentext, tokentextlen ));
In addition, if the postingshash usage exceeds half (the usage is adjustable), it will be extended
The Code is as follows:
If (numpostings = postingshashhalfsize)
Rehashpostings (2 * postingshashsize );
2.2 String Buffer
The following three caches the index content, which are obtained from documentswriter. The index structure in this buffer is not the index content structure written to the disk.
Intpool = perthread. intpool;
Charpool = perthread. charpool;
Bytepool = perthread. bytepool;
2.2.1 final charblockpool charpool ;-
Store word item content (string content)
For continuous storage, the adjacent word items are separated by 1 char, that is, the textlen + 1 length is stored.
The string pointer records the start position of each word item (no record length and end position, only the length is known through the start position of the adjacent string)
For example
China South University of Science and Technology 2 0 0 9 1 0 1 9 0 1 4 6
|- |-|
2.2.2 final intblockpool intpool;
The offset value of location information is stored.
For (INT I = 0; I <streamcount; I ++)
{
// Intuptos [intuptostart + 1] records the location information offset value,
// Intuptos [intuptostart + 0] not clear
Final int upto = bytepool. newslice (byteblockpool. first_level_size );
Intuptos [intuptostart + I] = upto + bytepool. byteoffset;
}
2 (streamcount) stores a word item, that is, each word item records two pointers
2.2.3 final byteblockpool bytepool;
Write word item Location Information
Store a word item location every 10 minutes (After encoding)
|-| --- |
0-5-10-15
The value of interval 10 is determined by the following:
1. byteblockpool. first_level_size = 5
2. mentioned: streamcount = 2 when the frequency and position of word items need to be saved
The Code is as follows:
For (INT I = 0; I <streamcount; I ++)
{
Final int upto = bytepool. newslice (byteblockpool. first_level_size );
Intuptos [intuptostart + I] = upto + bytepool. byteoffset;
}
Iii. class member functions
3.1 Constructor
Public termshashperfield (docinverterperfield,
Final termshashperthread perthread,
Final termshashperthread nextperthread,
Final fieldinfo)
Some index global variables are input during the constructor
Final fieldinfo is the field information
Incoming Buffer Memory Pool
Intpool = perthread. intpool;
Charpool = perthread. charpool;
Bytepool = perthread. bytepool;
3.2 Indexing Process
The index process is completed by the add () function.
Void add () throws ioexception {}
Void add () throws ioexception {
// Obtain the word content and length after word segmentation
Final char [] tokentext = termatt. termbuffer ();
Final int tokentextlen = termatt. termlength ();
// Calculate the postingshash position and use the word term character for encoding.
Int downto = tokentextlen;
Int code = 0;
While (downto> 0 ){
// Starting from high
Char CH = tokentext [-- downto];
// Saves Unicode in the middle
Code = (Code * 31) + CH; // character encoding
} // While (downto> 0)
// Final encoding of word items
Int hashpos = Code & postingshashmask;
// Locate rawpostinglist in hash
P = postingshash [hashpos];
// If the HA seat is filled, traverse to find the next vacant space
If (P! = NULL &&! Postingequals (tokentext, tokentextlen )){
// Conflict: Keep searching different locations in
// The hash table.
Final int Inc = (code> 8) + code) | 1;
Do {
Code + = Inc;
Hashpos = Code & postingshashmask;
P = postingshash [hashpos];
} While (P! = NULL &&! Postingequals (tokentext, tokentextlen ));
}
// Locate a position that can be filled (postingshash has never indexed this term)
If (P = NULL) {// The First Time P is null, because there is no hash conflict
Final int textlen1 = 1 + tokentextlen;
// The buffer for storing word item characters is full, extended
If (textlen1 + charpool. charupto> documentswriter. char_block_size) // 16384
{
If (textlen1> documentswriter. char_block_size ){
If (docstate. maxtermprefix = NULL)
Docstate. maxtermprefix = new string (tokentext, 0, 30 );
Consumer. skippinglongterm ();
Return;
}
Charpool. nextbuffer ();
}
// Pull next free rawpostinglist from Free List
P = perthread. freepostings [-- perthread. freepostingscount];
Final char [] Text = charpool. buffer;
Final int textupto = charpool. charupto;
// Record the offset address value of the word term character
P. textstart = textupto + charpool. charoffset;
Charpool. charupto + = textlen1 (original character Length + 1 );
// Copy the word term characters to the buffer
System. arraycopy (tokentext, 0, text, textupto, tokentextlen );
Text [textupto + tokentextlen] = 0 xFFFF;
Assert postingshash [hashpos] = NULL;
Postingshash [hashpos] = P;
Numpostings ++;
// Postingshash expands if the usage rate exceeds half
If (numpostings = postingshashhalfsize)
Rehashpostings (2 * postingshashsize );
// Init stream slices
If (numpostingint + intpool. intupto> documentswriter. int_block_size)
Intpool. nextbuffer (); // insufficient memory, Application
If (documentswriter. byte_block_size-bytepool. byteupto <numpostingint * byteblockpool. first_level_size)
Bytepool. nextbuffer (); // insufficient memory, Application
Intuptos = intpool. buffer; // record offset
Intuptostart = intpool. intupto;
Intpool. intupto + = streamcount; // use the streamcount offset value for each word item
P. intstart = intuptostart + intpool. effecffset;
For (INT I = 0; I <streamcount; I ++)
{
Final int upto = bytepool. newslice (byteblockpool. first_level_size );
Intuptos [intuptostart + I] = upto + bytepool. byteoffset;
}
P. bytestart = intuptos [intuptostart];
Consumer. newterm (P );
} Else {
// This indicates that this term has been indexed (again in the original document)
Intuptos = intpool. buffers [P. intstart> documentswriter. int_block_shift];
Intuptostart = P. intstart & documentswriter. int_block_mask;
Consumer. addterm (p); // Add a posting list and compress the code
}
}
Experiment 4
The following data supports the above parsing and indexing documents,
The content of document 1 is "Sina news", and the content of document 2 is "Lianhe Zaobao"
Adding D: \ file \ 2.txt
New
Intuptostart = 16
Intpool. intupto = 18
P. intstart = 16
Intuptos [intuptostart + I] = 80
Intuptos [intuptostart + I] = 85
P. bytestart = 80
Bytes [85] = 0
The value of rawpostinglist P is as follows:
Bytestart = 80
Docfreq = 1
Intstart = 16
Lastdoccode = 2
Lastdocid = 1
Lastposition = 0
Textstart = 39
Termshashperfield. writevint (1, 0)
Lang
Intuptostart = 18
Intpool. intupto = 20
P. intstart = 18
Intuptos [intuptostart + I] = 90
Intuptos [intuptostart + I] = 95
P. bytestart = 90
Bytes [95] = 2
The value of rawpostinglist P is as follows:
Bytestart = 90
Docfreq = 1
Intstart = 18
Lastdoccode = 2
Lastdocid = 1
Lastposition = 0
Textstart = 41
Termshashperfield. writevint (1, 1)
New (Note: This term appears for the second time)
Intuptostart = 16
Intpool. intupto = 20
P. intstart = 16
Bytes [86] = 4 // continuously store word item Location Information
The value of rawpostinglist P is as follows:
Bytestart = 80
Docfreq = 2 // The document DF is set to 2
Intstart = 16
Lastdoccode = 2
Lastdocid = 1
Lastposition = 0
Textstart = 39
Termshashperfield. writevint (1, 2)
Wen
Intuptostart = 20
Intpool. intupto = 22
P. intstart = 20
Intuptos [intuptostart + I] = 100
Intuptos [intuptostart + I] = 105
P. bytestart = 100
Bytes [105] = 6
The value of rawpostinglist P is as follows:
Bytes start = 100
Docfreq = 1
Intstart = 20
Lastdoccode = 2
Lastdocid = 1
Lastposition = 0
Textstart = 43
Termshashperfield. writevint (1, 3)
200910190147
Intuptostart = 22
Intpool. intupto = 24
P. intstart = 22
Intuptos [intuptostart + I] = 110
Intuptos [intuptostart + I] = 115
P. bytestart = 110
Bytes [115] = 0
Bytes start = 110
Docfreq = 1
Intstart = 22
Lastdoccode = 2
Lastdocid = 1
Lastposition = 0
Textstart = 45
D: \ file \ 2.txt
Intuptostart = 24
Intpool. intupto = 26
P. intstart = 24
Intuptos [intuptostart + I] = 120
Intuptos [intuptostart + I] = 125
P. bytestart = 120
Bytes [125] = 0
Bytes start = 120
Docfreq = 1
Intstart = 24
Lastdoccode = 2
Lastdocid = 1
Lastposition = 0
Textstart = 58
Adding D: \ file \ 3.txt
Connection
Intuptostart = 26
Intpool. intupto = 28
P. intstart = 26
Intuptos [intuptostart + I] = 130
Intuptos [intuptostart + I] = 135
P. bytestart = 130
Bytes [135] = 0
The value of rawpostinglist P is as follows:
Bytes start = 130
Docfreq = 1
Intstart = 26
Lastdoccode = 4
Lastdocid = 2
Lastposition = 0
Textstart = 72
Integration
Intuptostart = 28
Intpool. intupto = 30
P. intstart = 28
Intuptos [intuptostart + I] = 140
Intuptos [intuptostart + I] = 145
P. bytestart = 140
Bytes [145] = 2
The value of rawpostinglist P is as follows:
Bytes start = 140
Docfreq = 1
Intstart = 28
Lastdoccode = 4
Lastdocid = 2
Lastposition = 0
Textstart = 74
Early
Intuptostart = 30
Intpool. intupto = 32
P. intstart = 30
Intuptos [intuptostart + I] = 150
Intuptos [intuptostart + I] = 155
P. bytestart = 150
Bytes [155] = 4
The value of rawpostinglist P is as follows:
Bytes start = 150
Docfreq = 1
Intstart = 30
Lastdoccode = 4
Lastdocid = 2
Lastposition = 0
Textstart = 76
Report
Intuptostart = 32
Intpool. intupto = 34
P. intstart = 32
Intuptos [intuptostart + I] = 160
Intuptos [intuptostart + I] = 165
P. bytestart = 160
Bytes [165] = 6
The value of rawpostinglist P is as follows:
Bytes start = 160
Docfreq = 1
Intstart = 32
Lastdoccode = 4
Lastdocid = 2
Lastposition = 0
Textstart = 78
200910190148
Intuptostart = 34
Intpool. intupto = 36
P. intstart = 34
Intuptos [intuptostart + I] = 170
Intuptos [intuptostart + I] = 175
P. bytestart = 170
Bytes [175] = 0
D: \ file \ 3.txt
Intuptostart = 36
Intpool. intupto = 38
P. intstart = 36
Intuptos [intuptostart + I] = 180
Intuptos [intuptostart + I] = 185
P. bytestart = 180
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/todaylxp/archive/2009/10/25/4726017.aspx