Does Lucene support metadata search?

Source: Internet
Author: User
Tags idf

Lucene is not a complete full-text index application, but a full-text index engine toolkit written in Java. It can be easily embedded into various applications to implement full-text indexing/Retrieval for applications, lucene aims to add full-text retrieval functions for various small and medium-sized applications. (Reference http://www.chedong.com/tech/lucene.html)

Lucene consists of several modules, including word segmentation, index, and search. It supports single keyword query, range query, and phrase query, providing powerful support for building a full-text search engine. Lucene not only supports analysis and indexing of file content, but also performs analysis and indexing of file metadata. For example, Lucene can be used to create a library search engine, in this case, you can use the author, publisher, publication time, and number of the book as the domain to create an index.

In Linux, find provides a search function for file metadata, but its implementation efficiency is low. You need to traverse the file directory tree and match the search conditions one by one. Can Lucene be used to index the metadata of files to accelerate metadata search?

Differences between full-text search and metadata search:

1. the purpose of a user's full-text search is usually to find the location of the desired keywords (for example, when browsing the source code, you need to find the location where a function appears and may edit it, such as replacing it ), the user may obtain the location where the first keyword appears to achieve the purpose of this search, and may jump back to multiple locations (Lucene caches the results and caches only a portion of the search results each time, if the user needs the following results, Lucene continues to expand the cache size, increasing by two times), so if the first few results can meet service requirements, the subsequent results do not need to be read from the index on the hard disk. For example, if you search for "China" in a PDF file, the search engine will jump to the first place where "China" appears. When you find that the search location does not meet your needs, click Next. The information at the next location is already cached in the memory by Lucene (the initial cache size value can be set) and can be directly read, if the user does not find the desired location after multiple clicks, And the location information exceeds the size of the first cache, Lucene expands the cache to twice the access location, and read the content that is not in the memory to the cache. For retrieval of multiple documents, Lucene scores the results and sorts the results by score, so that even if there are many Matching content, users will not feel helpless in the face of huge search result sets. Lucene's scoring Mechanism for the results documentation is as follows: score = TF * IDF * boost * lengthnorm

1) TF: the square root of the number of times the keyword appears in the document. The higher the score of the document, the more times the keyword appears;

2) IDF: indicates the document reversal frequency. The default value is 1 + Log (numdocs/docfreq). That is, the fewer documents containing keywords, the higher the score of documents containing this keyword.

3) boost: the higher the incentive factor (the only unit that can be set), the higher the document score.

4) lengthnorm: it is determined by the length of the field to be searched. The longer the field content, the lower the score of the document, for example, the full-text search result has a lower score than the search result of a certain attribute.

2. The purpose of metadata retrieval is to find a file (or a batch of files) that meets certain conditions ). First, because the number of metadata-based keywords (UID, GID, access time, etc.) is limited, there may be a lot of search results. For example, if the search owner is Jack, there may be a lot of search results, in addition, for meta information, the document either does not match or does not have a score Order (even if it is not reasonable to calculate according to the above score ). Assume that an index is created for 90% files, and of files can be filtered out for each attribute (for example, there are 10 users in the system, and the file owner is evenly distributed ), when searching for three attributes at the same time, there are still 100 file files in the results. The user is still very embarrassed to face these 100 attributes, the limits of these three attributes may include range queries (equivalent to specifying multiple attribute values), and users usually do not set multiple attributes in the search time limit. In my personal opinion, find searchdir-name xxx. Another major problem lies in Word Segmentation. The standard word segmentation mechanism obviously cannot meet the requirements of metadata search. The key points are the word segmentation of file path names and the word segmentation of time information, size information (where there is overlap with the time information, of course, it can be converted, such as converting the actual size, followed by K, M, G as the unit ). During range query, Lucene uses string comparison by default to determine the range. For the file size value, it is obviously difficult to support range query. For the size attribute, range Query is indispensable.


After analyzing so many analyses, I still cannot determine whether Lucene can support metadata search well. Only time can be used to explain the problem. However, we can be sure that Lucene wants to perform metadata search well, improvements include:

1. Word Segmentation, analysis of metadata features, and proper word segmentation.

2. Range Query is mainly supported by time range and large scope.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.