Java search engine: Lucene study Note 2

Source: Internet
Author: User
Document directory
  • Boosting features
  • Indexing date
  • Indexing number
  • Sort
  • Indexwriter adjustment of Lucene
  • Ramdirectory and fsdirectory Conversion
  • Optimize indexes for queries)
  • Concurrent operations Lucene and locking mechanisms
  • Locing
  • Debug indexwriter
Boosting features

Luncene provides a configurable boosting parameter for document and field. The purpose of this parameter is to tell Lucene that some records are more important, when searching, give priority to them. For example, when searching, you may think that the web pages of several portals are more preferred than those of spam sites.

The default boosting parameter of Lucene is 1.0. If you think this field is important, you can set boosting to 1.5, 1. 2 .... etc. Set boosting for the document to set the benchmark boosting for each of its fields. Then, the actual field boosting is (document-boosting * field-boosting) set the same boosting again.

It seems that there is a boosting parameter in Lucene's scoring formula, but I don't think most people will study his formula (complicated), and the formula cannot provide the best value, so what we can do is to change the boosting at 1.1 points, and then observe in the actual detection how much it will play a role in the search results to adjust

In general, there is no need to use boosting, because if it is not good, you will mess up the search, and if it is a separate field for bossting, this field can also be used in advance to achieve a similar effect.

Indexing date

Date is one of the special considerations for Lucene, because we may need to perform a range search for the date, field. keyword (string, date) provides this method. Lucene converts this date to string. It is worth noting that the date here is accurate to milliseconds, there may be unnecessary Performance Losses, so we can convert the date into a situation like yyyymmdd, so we don't need to be accurate to the specific time, through file. keyword (stirng, string) to index, the use of prefixquery yyyy can also play a simplified version of the date range search (TIPS), Lucene mentioned that he cannot deal with the time before 1970, it seems to be a problem with the previous generation of computer systems.

Indexing number

  1. If the number is only simple data, for example, there are 56 nationalities in China, you can simply treat it as a character.
  2. If the number also contains the meaning of the value, such as the price, we need to search for the range (goods between 20 yuan and 30 yuan), then we must do some tips, such as putting 3, 34, 100 these three numbers are converted to 003,034,100, because after such processing, the sorting by character is the same as sorting by numerical value, while Lucene internally sorts by character, 003-& gt; 034-& gt; 100 not (100-& gt; 3-& gt; 34)
Sort

Lucene is sorted by relevance (score) by default. To support other sorting methods, such as dates, when we add a field, we must ensure that the field is indexed and cannot be tokenized (Word Segmentation ), and only numbers, dates, and characters can be sorted.

Indexwriter adjustment of Lucene

Indexwriter provides some parameters for setting. The list is as follows:

  Attribute Default Value Description
Mergefactor Org. Apache. Lucene. mergefactor 10 Controls the size and frequency of indexes.
Maxmergedocs Org. Apache. Lucene. maxmergedocs Integer. max_value Limit the number of documents in a segment
Minmergedocs Org. Apache. Lucene. minmergedocs 10 The number of documents cached in the memory, which will be written to the disk after exceeding the number.
Maxfieldlength   1000 The maximum number of terms in a field. If the limit is exceeded, it will not be indexed to the field. Therefore, it cannot be searched.

The detailed descriptions of these parameters are complex: mergefactor plays a dual role.

  1. Write a segment for each mergefactor document. For example, write a segment for every 10 documents.
  2. Set each mergefacotr segment to be merged into a large segment. For example, when 10 documents are merged into one segment, 10 segments are merged into one large segment, and 10 segments are merged later, the actual number of documents is the index of mergefactor.

To put it simply, the larger the mergefactor, the system will use more memory and less disk processing. If you want to create indexes in batches, you can set the mergefactor to a greater value. When the mergefactor is small, the number of indexes will also increase, and the efficiency of searhing will be reduced. However, when mergefactor increases by 1.1 points, the memory consumption will increase a lot (exponential relationship), so be careful not to "out of memory"
When maxmergedocs is set to a small value, a certain number of documents can be written as a segment, which can offset some mergefactor functions.
Minmergedocs is equivalent to setting a small cache. The first document in this number will be left in the memory and will not be written to the disk. These parameters have no optimal values, and must be adjusted based on the actual situation.
Maxfieldlength can be set at any time. After it is set, the field of the next index will be truncated according to the new length, and the part of the previous index will not change. It can be set to integer. max_value.

Ramdirectory and fsdirectory Conversion

Ramdirectory (ramd) is much more efficient than fsdirectyr (FSD), so we can use ramd as the FSD buffer manually, in this way, you don't have to tune so many FSD parameters. Instead, you can use Ram to run the index and periodically write back and forth to FSD. Ramd can be used as FSD buffer.

Optimize indexes for queries)

Indexwriter. the optimize () method can be used to optimize the index of a query. The parameter optimization mentioned earlier is the optimization of the indexing process, and the optimization here is the optimization of the query. The optimization mainly aims to reduce the number of index files, in this way, fewer files are opened during the query. During the optimization process, Lucene copies the old index and then merges it. After the merge is completed, the old index is deleted. Therefore, during this period, the disk usage increases, io compliance will also increase. After the optimization is completed, the disk usage will be twice before the optimization, and the search can be performed simultaneously during the optimize process.

Concurrent operations Lucene and locking mechanisms

  • All read-only operations can be concurrently performed.
  • All read-only operations can be performed concurrently when the index is modified.
  • You cannot modify the index concurrently. One index can only be occupied by one thread.
  • Index optimization, merging, and addition are all modification operations.

Instances of indexwriter and indexreader can be shared by multiple threads. They implement internal synchronization, so external use does not require synchronization.

Locing

Lucence uses files for locking. The default locking file is stored in Java. io. tmpdir, you can use-dorg. apache. lucene. lockdir = xxx specifies the new dir with write. lock commit. lock two files. The lock file is used to prevent parallel operations on indexes. If parallel operations are performed, Lucene throws an exception. You can disable locking by setting-ddisablelucenelocks = true. This is generally dangerous, unless you have read-only guarantee for the operating system or physical level, such as burning the index file to the CDROM.

Debug indexwriter

Indexwriter has an infostream variable. The debugging information is output here. You can set system. out to it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.