[Posting] lucene-2.9.0 indexing process (I) termshashperfield

Source: Internet
Author: User

Termshashperfield class

 

I. Category function Overview:

Responsible for the indexing process of word items. Each field has a corresponding termshashperfield. When a field is indexed, The add () function corresponding to termshashperfield is used to complete the (one) Word item indexing process, the index content (word string, pointer information, and location information) is stored in the memory buffer.

 

Ii. class member description:

 

2.1 final int streamcount;

If you need to record the word frequency and position, this value is 2 (Use two int records to offset the pointer), otherwise it is 1

 

//CodeAs follows:

Streamcount = consumer. getstreamcount (); // freqproxtermswriterperfield //

 

2.2 termattribute termatt;

Word items after word segmentation are stored here, So word items are retrieved from termatt during indexing.

 

2.3 rawpostinglist [] and rawpostinglist

The Code is as follows:

Private rawpostinglist [] postingshash = new rawpostinglist [postingshashsize];

Private rawpostinglist P;

 

Each p records the offset address, document number, Document Frequency (DF), and location information of the term.

And stored in postingshash. The subscript is the hash value.

 

The offset address is saved.

Abstract class rawpostinglist {

Int textstart;

Int intstart;

Int bytestart;

}

 

But the actual storage is freqproxtermswriter. postinglist

The Code is as follows:

Static final class postinglist extends rawpostinglist {

Int docfreq; // # times this term occurs in the current Doc

Int lastdocid; // last docid where this term occurred

Int lastdoccode; // code for prior Doc

Int lastposition; // last position where this term occurred

}

 

Therefore, postingshash actually stores

[0] = freqproxtermswriter $ postinglist (ID = 23)

Bytestart = 0

Docfreq = 1

Intstart = 0

Lastdoccode = 0

Lastdocid = 0

Lastposition = 0

Textstart = 0

 

Postingshash stores some index information of all word items in this field. It is hashed to postingshash Based on the encoding value of the word item character as the initial value.

The Code is as follows:

Int hashpos = Code & postingshashmask;

 

If an address conflict occurs, use a fixed step to increase the encoding value, find an empty location, and enter P. This step is

 

Final int Inc = (code> 8) + code) | 1;

 

The Code is as follows:

 

Do {

Code + = Inc;

Hashpos = Code & postingshashmask;

P = postingshash [hashpos];

} While (P! = NULL &&! Postingequals (tokentext, tokentextlen ));

 

 

In addition, if the postingshash usage exceeds half (the usage is adjustable), it will be extended

The Code is as follows:

If (numpostings = postingshashhalfsize)

Rehashpostings (2 * postingshashsize );

 

2.2 String Buffer

 

The following three caches the index content, which are obtained from documentswriter. The index structure in this buffer is not the index content structure written to the disk.

Intpool = perthread. intpool;

Charpool = perthread. charpool;

Bytepool = perthread. bytepool;

 

2.2.1 final charblockpool charpool ;-

Store word item content (string content)

For continuous storage, the adjacent word items are separated by 1 char, that is, the textlen + 1 length is stored.

The string pointer records the start position of each word item (no record length and end position, only the length is known through the start position of the adjacent string)

 

For example

 

China South University of Science and Technology 2 0 0 9 1 0 1 9 0 1 4 6

|- |-|

 

2.2.2 final intblockpool intpool;

 

 

The offset value of location information is stored.

 

For (INT I = 0; I <streamcount; I ++)

{

// Intuptos [intuptostart + 1] records the location information offset value,

// Intuptos [intuptostart + 0] not clear

Final int upto = bytepool. newslice (byteblockpool. first_level_size );

Intuptos [intuptostart + I] = upto + bytepool. byteoffset;

}

 

2 (streamcount) stores a word item, that is, each word item records two pointers

 

2.2.3 final byteblockpool bytepool;

 

Write word item Location Information

 

Store a word item location every 10 minutes (After encoding)

|-| --- |

0-5-10-15

 

The value of interval 10 is determined by the following:

1. byteblockpool. first_level_size = 5

2. mentioned: streamcount = 2 when the frequency and position of word items need to be saved

The Code is as follows:

For (INT I = 0; I <streamcount; I ++)

{

Final int upto = bytepool. newslice (byteblockpool. first_level_size );

Intuptos [intuptostart + I] = upto + bytepool. byteoffset;

}

 

 

 

 

Iii. class member functions

 

3.1 Constructor

Public termshashperfield (docinverterperfield,

Final termshashperthread perthread,

Final termshashperthread nextperthread,

Final fieldinfo)

 

Some index global variables are input during the constructor

Final fieldinfo is the field information

 

Incoming Buffer Memory Pool

Intpool = perthread. intpool;

Charpool = perthread. charpool;

Bytepool = perthread. bytepool;

 

 

3.2 Indexing Process

The index process is completed by the add () function.

Void add () throws ioexception {}

 

 

Void add () throws ioexception {

 

// Obtain the word content and length after word segmentation

Final char [] tokentext = termatt. termbuffer ();

Final int tokentextlen = termatt. termlength ();

 

// Calculate the postingshash position and use the word term character for encoding.

Int downto = tokentextlen;

Int code = 0;

While (downto> 0 ){

// Starting from high

Char CH = tokentext [-- downto];

// Saves Unicode in the middle

Code = (Code * 31) + CH; // character encoding

} // While (downto> 0)

 

// Final encoding of word items

Int hashpos = Code & postingshashmask;

 

// Locate rawpostinglist in hash

P = postingshash [hashpos];

 

// If the HA seat is filled, traverse to find the next vacant space

If (P! = NULL &&! Postingequals (tokentext, tokentextlen )){

// Conflict: Keep searching different locations in

// The hash table.

Final int Inc = (code> 8) + code) | 1;

Do {

Code + = Inc;

Hashpos = Code & postingshashmask;

P = postingshash [hashpos];

} While (P! = NULL &&! Postingequals (tokentext, tokentextlen ));

}

 

// Locate a position that can be filled (postingshash has never indexed this term)

If (P = NULL) {// The First Time P is null, because there is no hash conflict

 

Final int textlen1 = 1 + tokentextlen;

// The buffer for storing word item characters is full, extended

If (textlen1 + charpool. charupto> documentswriter. char_block_size) // 16384

{

If (textlen1> documentswriter. char_block_size ){

 

If (docstate. maxtermprefix = NULL)

Docstate. maxtermprefix = new string (tokentext, 0, 30 );

 

Consumer. skippinglongterm ();

Return;

}

Charpool. nextbuffer ();

}

 

// Pull next free rawpostinglist from Free List

P = perthread. freepostings [-- perthread. freepostingscount];

 

Final char [] Text = charpool. buffer;

Final int textupto = charpool. charupto;

 

// Record the offset address value of the word term character

P. textstart = textupto + charpool. charoffset;

Charpool. charupto + = textlen1 (original character Length + 1 );

 

// Copy the word term characters to the buffer

System. arraycopy (tokentext, 0, text, textupto, tokentextlen );

Text [textupto + tokentextlen] = 0 xFFFF;

Assert postingshash [hashpos] = NULL;

Postingshash [hashpos] = P;

Numpostings ++;

 

// Postingshash expands if the usage rate exceeds half

If (numpostings = postingshashhalfsize)

Rehashpostings (2 * postingshashsize );

 

// Init stream slices

If (numpostingint + intpool. intupto> documentswriter. int_block_size)

Intpool. nextbuffer (); // insufficient memory, Application

 

If (documentswriter. byte_block_size-bytepool. byteupto <numpostingint * byteblockpool. first_level_size)

Bytepool. nextbuffer (); // insufficient memory, Application

 

Intuptos = intpool. buffer; // record offset

Intuptostart = intpool. intupto;

Intpool. intupto + = streamcount; // use the streamcount offset value for each word item

 

P. intstart = intuptostart + intpool. effecffset;

 

For (INT I = 0; I <streamcount; I ++)

{

Final int upto = bytepool. newslice (byteblockpool. first_level_size );

Intuptos [intuptostart + I] = upto + bytepool. byteoffset;

}

P. bytestart = intuptos [intuptostart];

 

Consumer. newterm (P );

 

} Else {

// This indicates that this term has been indexed (again in the original document)

Intuptos = intpool. buffers [P. intstart> documentswriter. int_block_shift];

Intuptostart = P. intstart & documentswriter. int_block_mask;

Consumer. addterm (p); // Add a posting list and compress the code

}

}

 

Experiment 4

The following data supports the above parsing and indexing documents,

The content of document 1 is "Sina news", and the content of document 2 is "Lianhe Zaobao"

 

Adding D: \ file \ 2.txt

New

Intuptostart = 16

Intpool. intupto = 18

P. intstart = 16

Intuptos [intuptostart + I] = 80

Intuptos [intuptostart + I] = 85

P. bytestart = 80

Bytes [85] = 0

 

The value of rawpostinglist P is as follows:

Bytestart = 80

Docfreq = 1

Intstart = 16

Lastdoccode = 2

Lastdocid = 1

Lastposition = 0

Textstart = 39

 

Termshashperfield. writevint (1, 0)

 

Lang

Intuptostart = 18

Intpool. intupto = 20

P. intstart = 18

Intuptos [intuptostart + I] = 90

Intuptos [intuptostart + I] = 95

P. bytestart = 90

Bytes [95] = 2

 

The value of rawpostinglist P is as follows:

Bytestart = 90

Docfreq = 1

Intstart = 18

Lastdoccode = 2

Lastdocid = 1

Lastposition = 0

Textstart = 41

 

Termshashperfield. writevint (1, 1)

 

New (Note: This term appears for the second time)

Intuptostart = 16

Intpool. intupto = 20

P. intstart = 16

Bytes [86] = 4 // continuously store word item Location Information

 

The value of rawpostinglist P is as follows:

Bytestart = 80

Docfreq = 2 // The document DF is set to 2

Intstart = 16

Lastdoccode = 2

Lastdocid = 1

Lastposition = 0

Textstart = 39

 

Termshashperfield. writevint (1, 2)

 

Wen

Intuptostart = 20

Intpool. intupto = 22

P. intstart = 20

Intuptos [intuptostart + I] = 100

Intuptos [intuptostart + I] = 105

P. bytestart = 100

Bytes [105] = 6

 

The value of rawpostinglist P is as follows:

Bytes start = 100

Docfreq = 1

Intstart = 20

Lastdoccode = 2

Lastdocid = 1

Lastposition = 0

Textstart = 43

 

Termshashperfield. writevint (1, 3)

 

200910190147

Intuptostart = 22

Intpool. intupto = 24

P. intstart = 22

Intuptos [intuptostart + I] = 110

Intuptos [intuptostart + I] = 115

P. bytestart = 110

Bytes [115] = 0

 

Bytes start = 110

Docfreq = 1

Intstart = 22

Lastdoccode = 2

Lastdocid = 1

Lastposition = 0

Textstart = 45

 

 

D: \ file \ 2.txt

Intuptostart = 24

Intpool. intupto = 26

P. intstart = 24

Intuptos [intuptostart + I] = 120

Intuptos [intuptostart + I] = 125

P. bytestart = 120

Bytes [125] = 0

 

Bytes start = 120

Docfreq = 1

Intstart = 24

Lastdoccode = 2

Lastdocid = 1

Lastposition = 0

Textstart = 58

 

Adding D: \ file \ 3.txt

Connection

Intuptostart = 26

Intpool. intupto = 28

P. intstart = 26

Intuptos [intuptostart + I] = 130

Intuptos [intuptostart + I] = 135

P. bytestart = 130

Bytes [135] = 0

 

The value of rawpostinglist P is as follows:

Bytes start = 130

Docfreq = 1

Intstart = 26

Lastdoccode = 4

Lastdocid = 2

Lastposition = 0

Textstart = 72

 

 

Integration

Intuptostart = 28

Intpool. intupto = 30

P. intstart = 28

Intuptos [intuptostart + I] = 140

Intuptos [intuptostart + I] = 145

P. bytestart = 140

Bytes [145] = 2

 

The value of rawpostinglist P is as follows:

Bytes start = 140

Docfreq = 1

Intstart = 28

Lastdoccode = 4

Lastdocid = 2

Lastposition = 0

Textstart = 74

 

 

Early

Intuptostart = 30

Intpool. intupto = 32

P. intstart = 30

Intuptos [intuptostart + I] = 150

Intuptos [intuptostart + I] = 155

P. bytestart = 150

Bytes [155] = 4

 

The value of rawpostinglist P is as follows:

Bytes start = 150

Docfreq = 1

Intstart = 30

Lastdoccode = 4

Lastdocid = 2

Lastposition = 0

Textstart = 76

 

 

Report

Intuptostart = 32

Intpool. intupto = 34

P. intstart = 32

Intuptos [intuptostart + I] = 160

Intuptos [intuptostart + I] = 165

P. bytestart = 160

Bytes [165] = 6

 

The value of rawpostinglist P is as follows:

Bytes start = 160

Docfreq = 1

Intstart = 32

Lastdoccode = 4

Lastdocid = 2

Lastposition = 0

Textstart = 78

 

 

200910190148

Intuptostart = 34

Intpool. intupto = 36

P. intstart = 34

Intuptos [intuptostart + I] = 170

Intuptos [intuptostart + I] = 175

P. bytestart = 170

Bytes [175] = 0

 

D: \ file \ 3.txt

Intuptostart = 36

Intpool. intupto = 38

P. intstart = 36

Intuptos [intuptostart + I] = 180

Intuptos [intuptostart + I] = 185

P. bytestart = 180

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/todaylxp/archive/2009/10/25/4726017.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.