There are two articles A and B
The content of article A is:
Tom lives in Guangzhou, I live in Guangzhou
The content of article B is:
He once lived in Shanghai.
1. Because Lucene is based on keyword indexing and query, we need to obtain the keywords of the two articles. Generally, we need to take the following measures.
We now have some content in the article, that is, a string. First we need to find all words in the string, that is, word segmentation. English words are better processed because they are separated by spaces. Chinese words must be connected together.
B. The words "in" and "once" and "too" in the article have no practical significance. The words "yes" in Chinese usually have no specific meaning,
These do not mean that the concepts can be filtered out. This is the stoptokens mentioned in Lucene detailed analysis.
C. users usually want to find articles containing "he" and "he" When querying "he". Therefore, all words must be case-sensitive.
Users usually want to find articles including lives and lived when querying "live". Therefore, they need to restore lives and lived to "live"
E. punctuation marks in the document do not usually represent a certain concept, but can also be filtered out. The above measures in Lucene are completed by the analyzer class. After processing:
All keywords in article 1 are: [Tom] [live] [Guangzhou] [live] [Guangzhou]
All the keywords in Article 2 are: [he] [live] [Shanghai]
2. With the keyword, we can create an inverted index.
The correspondence above is: "Article number" to "all keywords in the Article ". Inverted indexes reverse this relationship to the keyword pair "All document numbers with this keyword ". Article 1, 2
<! -- [If! Supportlinebreaknewline] -->
Generally, it is not enough to know which articles the keyword appears in. We also need to know the number of times and the location of the keyword appears in the article. there are usually two types of positions: a) character location, that is, to record the number of words in the article
Character (advantage is fast positioning when the keyword is highlighted); B) keyword location, that is, record the word as the first keyword in the article (advantage is to save the index space, phrase (phase) fast query ),
This location is recorded in Lucene.
After the "occurrence frequency" and "location" information are added, our index structure becomes:
Article No. [frequency]
1 , 2 
2, 5, 2
This line is used as an example to illustrate this structure: Live appears twice in article 1 and once in Article 2. What does it mean when it appears at "2, 5, 2? We need to combine the document number and appearance
The frequency is analyzed. Article 1 appears twice, so "2, 5" indicates the two positions of live in article 1, and article 2 appears once, the remaining "2" indicates that live is the first
The above is the core part of the Lucene index structure. We noticed that keywords are arranged in character Order (Lucene does not use the B tree structure), so Lucene can use binary search to calculate
Method to quickly locate keywords.
Lucene uses the above three columns as the dictionary files (term dictionary), frequency files (frequencies), and location files respectively.
(Positions) Save. The dictionary file not only stores each keyword, but also retains the pointer to the frequency file and location file. The pointer can be used to find the frequency information and bit confidence of the keyword.
Lucene uses the field concept to express the location of the information (such as the title, article, or URL). In the index being created, the field information is also recorded in the dictionary file,
Each keyword has a field information (because each keyword must belong to one or more fields ).
To reduce the size of the index file, Lucene also uses the compression technology for the index. First, the keywords in the dictionary file are compressed, and the keywords are compressed into <prefix length, suffix>, for example
For example, if the current word is "Arabic" and the previous word is "Arabic", the "Arabic" is compressed into <3, language>. Second, a large amount of data is used to compress numbers. The numbers are saved only with the previous value.
(This can reduce the length of the number, and thus reduce the number of bytes required to save the number ). For example, if the current article number is 16389 (it must be saved in 3 bytes without compression) and the previous article number is 16382,
After compression, save 7 (only one byte ).
Next we can explain why the index should be created by querying the index.
Assume that you want to query words
"Live", Lucene first searches for and finds the word in the dictionary binary, reads all article numbers by pointing to the pointer to the frequency file, and then returns the result. The dictionary is usually very small, so the whole process time is
However, using a common sequential matching algorithm, instead of creating an index, is to perform String Matching on the content of all articles. This process will be quite slow. When the number of articles is large, time is often intolerable.
3.2. Lucene correlation point formula
Score_d = sum_t (tf_q * idf_t/norm_q * TF_D * idf_t/norm_d_t *
Boost_t) * coord_q_d
Score_d: score of the document d
Sum_t: sum of all items
Tf_q: the square root of the number of times an item is displayed in the query string Q.
TF_D: In document D, the square root of the number of occurrences of an item
Numdocs: In this index, find the total number of documents whose scores are greater than 0.
Docfreq_t: Total number of documents containing item t
Idf_t: log (numdocs/docfreq + 1) + 1.0
Norm_q: SQRT (sum_t (tf_q * idf_t) ^ 2 ))
Norm_d_t: In document D, the square root of the total number of all items in the same domain as item t
Boost_t: increase factor of item T, generally 1.0
Coord_q_d: In document D, the number of hit items divided by the total number of items in the query Q
3.3. Other Lucene features
3.3.1. Boosting features
Luncene provides a configurable boosting parameter for document and field. The purpose of this parameter is to tell Lucene,
Some records are more important. They are preferred when searching. For example, when searching, you may think that the web pages of several portals are more important than those of spam sites.
Lucene's default boosting parameter is 1.0. If you think this field is important, you can set boosting to 1.5,
The boosting setting for the document sets the benchmark boosting for each of its fields. Then, the actual boosting of the field is
(Document-boosting * field-boosting) the same boosting is set once.
It seems that there is a boosting parameter in Lucene's scoring formula, but I guess most people will not study his formula (complicated), and the formula cannot provide the best value, so we
What we can do is to change boosting at 1.1 points, and then observe in the actual detection how much it will play to adjust the search results.
In general, there is no need to use boosting, because if it is not good, you will mess up the search, and if it is a separate field for bossting,
This field can also be used in advance to achieve a similar effect.
3.3.2. indexing date
Date is one of the special considerations for Lucene, because we may need to perform a range search for the date,
Field. Keyword (string, date) provides this method. Lucene converts this date to string,
It is worth noting that the date here is accurate to milliseconds, and there may be unnecessary performance losses,
So we can also convert the date into a situation like yyyymmdd, so we don't need to be accurate to the specific time, through file. Keyword (stirng, string)
To index. Using the yyyy of prefixquery can also be used for the simplified version of date range search (TIPS ),
Lucene mentioned that he could not deal with the time before 1970, which seems to be a problem left by the previous generation of computer systems.
3.3.3. indexing number
If the number is only simple data, for example, there are 56 nationalities in China, you can simply treat it as a character.
If the number also contains the meaning of the value, such as the price, we need to search for the range (goods between 20 yuan and 30 yuan), then we must do some tips, such as putting 100
These three numbers are converted to 003,034,100, because after processing,
Sorting by character is the same as sorting by numerical values, while Lucene internally sorts by character, 003-> 034-> 100
Not (100-> 3-> 34)
Lucene is sorted by relevance (score) by default. To support other sorting methods, such as date, we add
During field operation, the field must be indexed and cannot be tokenized (Word Segmentation), and can only be sorted by numbers, dates, and characters.
3.3.5. indexwriter adjustment of Lucene
Indexwriter provides some parameters for setting. The list is as follows:
Org. Apache. Lucene. mergefactor
Controls the size and frequency of indexes.
Org. Apache. Lucene. maxmergedocs
Limit the number of documents in a segment
Org. Apache. Lucene. minmergedocs
The number of documents cached in the memory, which will be written to the disk after exceeding the number.
The maximum number of terms in a field. If the limit is exceeded, it will not be indexed to the field. Therefore, it cannot be searched.
The detailed descriptions of these parameters are complex: mergefactor plays a dual role.
Write a segment for each mergefactor document. For example, write a segment for every 10 documents.
Set each mergefacotr segment to be merged into a large segment. For example, when 10 documents are merged into one segment, 10 segments are merged into one large segment, and 10 segments are merged into one large segment.
After merging, the actual number of documents will be the index of mergefactor.
Simply put, mergefactor
Larger, the system will use more memory and less disk processing. If you want to index data in batches, you can set mergefactor correctly. When mergefactor is smaller,
The number of indexes will also increase, and the efficiency of searhing will be reduced. However, with the mergefactor increasing by 1.1, the memory consumption will increase a lot (exponential relationship), so be careful not
To "out of memory"
When maxmergedocs is set to a small value, a certain number of documents can be written as a segment, which can offset some mergefactor functions.
Minmergedocs is equivalent to setting a small cache. The first document in this number will be left in the memory and will not be written to the disk. These parameters have no optimal values,
It must be adjusted according to the actual situation.
Maxfieldlength can be set at any time,
After the index is set, the field of the next index will be truncated according to the new length, and the previous part of the index will not change. It can be set to integer. max_value.
3.3.6. ramdirectory and fsdirectory Conversion
Ramdirectory (ramd) is much more efficient than fsdirectyr (FSD,
Therefore, we can manually use ramd as the FSD buffer, so that we don't have to tune so many FSD parameters, we can use Ram to run the index, periodically (or
Is another algorithm) write back and forth in FSD. Ramd can be used as FSD buffer.
3.3.7. Optimize indexes for queries)
Indexwriter. Optimize ()
The method can be to optimize the index for the query. the previously mentioned parameter optimization is to optimize the indexing process itself. Here, the optimization is to optimize the query, which mainly reduces the number of index files.
During the optimization process, Lucene copies the old index and merges it. After the merge is completed, the old index is deleted. Therefore, the disk usage increases during this period,
Io compliance will also increase. After the optimization is completed, the disk usage will be twice before the optimization, and the search can be performed simultaneously during the optimize process.
3.3.8. concurrent operations on Lucene and locking mechanisms
V concurrent read-only operations
During the period when the index is modified, all read-only operations can be performed concurrently.
V cannot modify the index concurrently. One index can only be occupied by one thread.
V index optimization, merging, and adding are all modification operations.
V indexwriter and indexreader instances can be shared by multiple threads. They implement internal synchronization, so they do not need to be synchronized when used outside.
Locking is used internally. The default locking file is stored in Java. Io. tmpdir.
-Dorg. Apache. Lucene. lockdir = xxx specifies the new dir with write. Lock.
Commit. Lock two files. The lock file is used to prevent parallel operations on the index. If parallel operations are performed,
Lucene throws an exception. You can disable locking by setting-ddisablelucenelocks = true. This is generally dangerous unless you have an operation
System or physical-level read-only guarantee, for example, the index file is engraved on the CDROM.
4. Lucene document structure
The most basic concepts in Lucene are index, document (document., field), and term ).
An index contains a document sequence.
· Documents are sequences of some domains.
· A domain is a sequence of items.
· An item is a string.
The same string that exists in different domains is considered to be different items. Therefore, the item is actually represented by a pair of strings. The first string is the domain name, and the second string is the string in the domain.
4.1. Lucene concept details
4.1.1. Domain type
In Lucene, the text of the domain may be stored in the index in a non-inverted way. The inverted domain is called indexed. The domain may be stored and indexed at the same time.
The text of a domain may be indexed by many projects, or being indexed as a project. Most of the fields are broken down, but sometimes some identifiers are used as a project index.
4.1.2. segment (segment)
Lucene indexes may consist of multiple sub-indexes which become segments. Each segment is a complete and independent index that can be searched. The index is as follows:
1. Create a new segment for the newly added document.
2. merge existing segments.
Multiple segments and/or multiple indexes are involved in the search. Each index may consist of several segments.
4.1.3. Document No. (document. nbspnumber)
Internally, Lucene uses an integer document number to indicate the document. The first document to be added to the index is numbered 0.
The number that is incremented by the Code.
Note that document numbers may change, so be careful when storing these numbers externally in Lucene. In particular, the number changes are as follows:
Only numbers in segments are the same and different segments are different. Therefore, when using these numbers in a wider context, you must change them. The standard technology is based on the number of each segment for each segment
Enter a field number. Add the field number when converting the document number in the specified segment to a non-section. When a document number outside a certain segment is converted to a specific segment, the document belongs to that segment based on the possible number range after the conversion in each segment, and
Field number. For example, if two segments with five documents are merged, the first segment is 0, and the second segment is 5. In the third document in section 2, the number out of the section is 8.
· After the document is deleted, consecutive numbers are interrupted. This can be solved by merging indexes. The documents deleted during segment merging are also deleted, and the newly merged segments are not interrupted by numbers.
4.1.4. index information
The index segment maintains the following information:
· Domain set. Contains all the fields used in the index.
Domain value storage table. Each document contains a list of "Attribute-value" pairs. The attribute is the domain name. This list stores additional information about a document, such as the title, URL, or an ID used to access the database.
The set of storage domains can be returned during search. This table is identified by document number.
· Item dictionary. This dictionary contains all the items used in all the documents in the domain, the document numbers used in the documents, and pointers pointing to the usage frequency and location information.
· Item frequency information. For each item in the item dictionary, This information includes the total number of documents containing this item, and the number of times used in each document.
· Item location information. Each item in the item dictionary is stored in every position in each document.
· Standardization factor. Each field in the document has a value that is used to multiply the hit number (hits) of this field ).
· Deleted document information. This is an optional file to indicate that the documents have been deleted.
The following sections describe the information in detail.
4.1.5. File naming)
Files of the same segment have the same file name and different extensions. The extension is determined by the various file formats discussed below.
In general, an index stores a directory, and all its segments are stored in this directory. If this is not done, it is also possible, with low performance.
4.2. Lucene basic data type (primitive types)
The most basic data type is byte ). Files are accessed in byte order. Other data types are also defined as byte sequences. The file format is byte independent.
Uint32: A 32-bit unsigned integer consisting of four bytes. Uint32 --> <byte> 4
Uint64: A 64-bit unsigned integer consisting of eight-character segments, with a high priority. Uint64 --> <byte> 8
A variable-length positive integer. The maximum bit of each byte indicates the number of bytes remaining. The lower seven bits per byte indicate the integer value. Therefore, the single-byte value ranges from 0 to 127, and the two-byte value ranges from 128 to 16,383.
Example of Vint Encoding
... This encoding provides a way to compress data in High-Efficiency decoding.
4.2.2. String chars
Lucene outputs UNICODE character sequences encoded using a standard UTF-8.
String: Lucene outputs a string composed of Vint and string. Vint indicates the string length, followed by the string.
String --> Vint, Chars
4.3. index files (Per-index files)
4.3.1. segments File
The active segments in the index are stored in the segments file. Each index can contain only one file named "segments". This file lists the names of each segment and the size of each segment in sequence.
Segments --> segcount, <segname, segsize> segcount
Segcount, segsize --> uint32
Segname --> string
Segname indicates the name of the segment and serves as the prefix for indexing other files.
Segsize is the number of documents contained in the segment index.
4.3.2. Lock File
Some files are used to indicate that another process is using indexes.
If the "Commit. Lock" file exists, a process is writing the "segments" file and deleting useless segment index files, or a process is reading the "segments" file.
And open files in some segments. After a process reads the segment information of a "segments" file, the lock file can prevent another process from deleting the files of all segments.
· If the "index. Lock" file exists, a process adds a document to the index or deletes the document from the index. This file prevents many files from modifying an index at the same time.
4.3.3. deleteable File
A file named "deletetable" contains the names of files that are no longer used by the index. These files may not be actually deleted. This situation only exists on the Win32 platform, because
Files in Win32 cannot be deleted when they are opened.
Deleteable --> delablecount, <delablename> delablecount
Delablecount --> uint32
Delablename --> string
4.3.4. files contained in segments (Per-segment files)
The remaining files are included in each segment, so they are differentiated by suffixes.
All domain names are stored in the domain set information of this file. This file is suffixed with. FNM.
Fieldinfos (. FNM) --> fieldscount, <fieldname,
Fieldscount --> Vint
Fieldname --> string
Fieldbits --> byte
Currently, fieldbits is used only at a low level. The value of the indexed domain is 1, and the value of the unindexed domain is 0.
The fields in the file are numbered according to their order. Therefore, domain 0 is the first domain in the file, and domain 1 is the next domain, and so on. This is the same as document number.
4.3.5. Domain value storage table (stored fields)
The Domain value storage table uses two files:
1. domain index (. fdx file ).
For each document, this file contains a pointer to the domain value:
Fieldindex (. fdx) --> <fieldvaluesposition> segsize
Fieldvaluesposition --> uint64
Indicates the location of the Domain value of a document in the Domain value file. Because the Domain value file contains a fixed length of data information, it is easy to randomly access. In the domain value file, the domain value information of Document N exists N * 8 bits.
Location (The position of document. nbsps' field data is the uint64 at N * 8 in
This file .).
2. Domain value (. FDT file ).
The Domain value information of each document is as follows:
Fielddata (. FDT) --> <docfielddata> segsize
Docfielddata --> fieldcount, <fieldnum, bits, value> fieldcount
Fieldcount --> Vint
Fieldnum --> Vint
Bits --> byte
Value --> string
Currently, BITs is used only at a low level. A value of 1 indicates that the domain name has been decomposed. A value of 0 indicates that the domain name has not been decomposed. Bytes
4.3.6. term dictionary)
The item dictionary is represented by the following two files:
1. Item information (. Tis file ).
Terminfofile (. Tis) --> termcount, terminfos
Termcount --> uint32
Terminfos --> <terminfo> termcount
Terminfo --> <term, docfreq, freqdelta, proxdelta>
Term --> <prefixlength, suffix, fieldnum>
Suffix --> string
Prefixlength, docfreq, freqdelta, proxdelta
Items are sorted by items. When sorting item information, sort by the text order of the field to which the item belongs, and then sort by the text order of the string of the item.
The word prefix of the item is often the same, and it is composed of the Word Suffix. The prefixlength variable indicates the number of words with the same prefix as the previous one. Therefore, if the word of the previous item
Is "bone". If the last one is "boy", the prefixlength value is 2 and the suffix value is "Y ".
Fieldnum indicates the domain Number of the item, and the domain name is stored in the. FDT file.
Docfreg indicates the number of documents containing this item.
Freqdelta specifies the location of the termfreq variable to which the item belongs in the. frq file. In details, it refers to the position offset (or 0, indicating
Proxdelta specifies the location of the termposition variable to which the item belongs in the. PRx file. Specifically, it refers to the location offset (or
Is 0, indicating the first item in the file ).
2. Item information index (. tii file ).
Each item information index file contains 128 entries in the. Tis file, according to the order of the entries in the. Tis file. This design aims to read the index information into the memory at a time, and then use it
To access the. Tis file.
The structure of this file is very similar to that of the. Tis file. Only one variable indexdelta is added to each entry record.
Terminfoindex (. tii) --> indextermcount, termindices
Indextermcount --> uint32
Termindices --> <terminfo, indexdelta> indextermcount
Indexdelta --> Vint
Indexdelta indicates the position of the terminfo variable value in the. Tis file. In details, it refers to the offset (or 0) relative to the first entry in the file.
4.3.7. Item Frequency (frequencies)
The. frq file contains a list of each document and the frequency of this item in the corresponding document.
Freqfile (. frq) --> <termfreqs> termcount
Termfreqs --> <termfreq> docfreq
Termfreq --> docdelta, freq?
Docdelta, freq --> Vint
The termfreqs sequence is sorted by items (based on the items in the. Tis file, that is, the items exist implicitly ).
Termfreq tuples are listed in ascending order of document numbers.
Determines the document number and frequency. In details, docdelta/2 indicates the offset relative to the previous document number (or 0, indicating that this is the first item in termfreqs ). When
When docdelta is an odd number, it indicates that the number of intermediate frequencies in this document is 1. When docdelta is an even number, another VINT (freq) indicates the frequency of occurrence in this document.
For example, if one item appears in document 7 and appears 3 times in document 11, the following vints sequence exists in termfreqs:
15, 22, 3
The. PRx file contains a list of the location information of an item in a document.
Proxfile (. PRx) --> <termpositions> termcount
Termpositions --> <positions> docfreq
Positions --> <positiondelta> freq
Positiondelta --> Vint
Termpositions are sorted by items (based on the items in the. Tis file, that is, the items exist implicitly ).
Positions tuples are listed in ascending order of document numbers.
Positiondelta is the offset from the previous position (or 0, indicating this is the first time in this document ).
For example, if one item appears in 4th items of a document and 5th and 9th items appear in another document, the following Vint sequence exists:
4, 5, 4
4.3.9. Normalization factor)
The. NRM file contains the standardization factor for each document. The standardization factor is used to multiply the number of hits in this field.
Norms (. NRM) --> <byte> segsize
Each byte records a floating point number. The digit 0-2 contains the three-digit ending part, and the digit 3-8 contains the five-digit exponent part.
These bytes can be converted to IEEE Standard Single-precision floating point numbers according to the following rules:
1. If the byte is 0, it is a floating point 0;
2. Otherwise, set the flag of the new floating point to 0;
3. Add the exponent in the byte value to 48 and use it as the index of the new floating point number;
4. Map the ending number in the byte to the 3-bit high of the ending number of the new floating point number.
5. Set the 21-bit lower of the ending number of the new floating point to 0.
4.3.10. The deleted document (Deleted document)
The. Del file is optional and exists only after a certain segment has been deleted:
Deletions (. DEL) --> bytecount, bitcount, Bits
Bytesize, bitcount --> uint32
Bits --> <byte> bytecount
Bytecount indicates the number of bytes in the bits list. Typically, it is equal to (segsize/8) + 1.
Bitcount indicates the number of bits in the list that have been set.
The bits list contains some bits that represent a document in sequence. When the bit corresponding to the document number is set, it indicates that the document has been deleted. The order of BITs is from low to high. Therefore, if
Bits contains two bytes, 0x00 and 0x02, which indicates that document 9 has been deleted.
In the preceding file format, there are several restrictions and the maximum number of documents is the 32-digit limit, that is, close to 4 billion. Today, this will not cause problems, but in the long run, it may cause problems. Because
Here, these limits should be replaced with uint64 values, or better, with VINT values (there is no upper limit for Vint values ).
The code in two places must be a fixed length value. They are:
1. fieldvaluesposition variable (stored in the. fdx file of the domain index file ). It is already a uint64 type, so there is no problem.
The termcount variable (stored in the. Tis file ). This is the final output to the file, but it is read first, so it is the front-end of the file. The index code is written here first.
Enter a value of 0 and overwrite the value after other files are output. Therefore, no matter where it is stored, it must be a fixed-length value and it should be changed to the uint64 type.
In addition, all uint values can be changed to Vint type to remove the restriction.