See Lucene source code must know the basic rules and algorithms

Last Update:2017-04-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When I was in high school, I wrote the text, and my favorite season was writing winter. Although it is because of the few people who write winter, it is true that there is no special preference for other seasons, but in the winter, their skin will become particularly white. But winter Ah, see only the Four Seasons Evergreen Bonsai: Melon Chestnut (is the rich tree, OK, I am sentimental, anyway, I do not like this name), Green Rose, never see it flowering Brazilian iron, rich bamboo, loose tail kwai ... Chinese New Year when the Cuckoo on the house blossom, sporadic several small flowers more highlights the bleak season. Red palm, butterfly orchid is always meimei in there, but can not see the vitality. Do not forget to plug in the water, Yang, see they will only think of a few days they will wither fate. Spring comes, first winter jasmine, then peach blossom, Magnolia. In the April, red leaves peach, bauhinia, cherry blossom, purple leaf plum, begonia ... Most like the taste of lilac ~ ~ In a few days, tulips and peony should also open. Dying young, shining the Chinese. Sure enough, these flowers in the sun are overflowing with color. The sorrow of life is not the endless pain after a short period of happiness, but never makes oneself happy. Think of the small meat to see the "Bear Haunt-snow Ridge Bear Wind" film, Kumaji did not meet again before the soul son not to guard the house, and dumplings experienced wonderful, although others do not remember, all the scene back to the original, Kumaji heart is satisfied and calm. Like these flowers, although the flowers are not long, but in full bloom of youth is better than the Holly Day and Life no difference (middle school composition is also always praised it winter or green it [here]). Perhaps the biggest difference between now and the middle school is that the outlook on life is more influenced by the parents. Parents are doctors, the rice bowl, stability is a constant pursuit. More and more away from their parents, living more and more like themselves, only to find their own life needs winter expectations and thinking, spring flowers enchanting, summer leaves lush, autumn fruit heavy. Who rules the first season is spring? The first season of my life is not

Here are some of the basic rules and algorithms that Lucene uses. The choice of these rules and algorithms is related to Lucene and a terabyte-capable inverted index.

Prefix suffix rule (prefix+suffix): In Lucene's reverse index, to save the information of the dictionary, all the words in the dictionary are sorted in dictionary order, and then the dictionary contains almost all the words in the document, and some of the words are very long, so that the index file will be very large, The so-called prefix suffix rule is that when a word has a common prefix with the previous word, the following words only hold the prefix in the word (offset), and the remainder (suffix).

For example: Beijing Tian ' an gate this word dictionary usually contains Beijing Tian ' an gate Beijing Tian ' an gate these three words. Beijing and Beijing Tian ' an door because of the same prefix, in the dictionary table will be adjacent storage, two words into Beijing 2 Tiananmen Square, so save the space than Beijing Tian ' an gate province.

Difference rule (Delta): In Lucene's reverse index, you need to save a lot of information about shaping numbers, such as the document ID number, such as the position of the word in the document, and so on. Shaped numbers are stored in a variable-length integer format. As the number increases, each digit occupies more bits. The so-called difference rule is the time to save two integers, followed by the whole number is just the difference between the first and the previous integer. More nagging two sentences: because see some brothers define database field when always want to do not want to use VARCHAR,MD5 results also use varchar[Khan]. The result length of the MD5 is fixed and there is no need to use varchar to save space. The fixed-length char will be more efficient.

LZ4 algorithm (Realtime Compression algorithm): In the operating system (LINUX/FREEBSD), file System (OPENZFS), Big Data (Hadoop), search engine (LUCENE/SOLR), Database (Hbase ) and so on can see it's figure, very general. Fast compression/decompression speed.

Jump Table rule (skip list): A jumping table is a data structure. Amount ~ ~, must not use a few words to introduce it to understand, really embarrassed to say oneself has so many algorithm patent. The first thing to do with jumping tables is because the index data of search engines is highly ordered. For example: I go home from Beijing Qingzhou can do Beijing south to Qingdao train or high-speed rail. Their route is the same, the latter cost 100 dollars. Where are you? The latter stop the station is less, just jump station. Some high-speed trains to Qingzhou. I can only in the previous station in Zibo or after a station Weifang get off, and then take the slow train to Qingzhou. Jumping tables are the principle. All the search data exists in a list, which is the slow train (the most traditional green leather car). Then add a new linked list, the data stored in the middle of the interval (K-word car). At this point I have to say a principle: all the original time complexity is the delta (find this symbol more laborious, I directly in English, remember it is very good, to the rice country always avoid and this airline deal ~ ~) n algorithm, the expected ultimate optimization results are basically delta log n. So only two stories, time complexity is not up to the requirements. How to meet the requirements? Eventually a tree will be formed. How to form a tree? Add layer chant. Increase the interval of jumping station, T-head car, D-head car, G-head car. All the stops to the middle are all the stations that form a root. The tree-shaped structure is formed. Time complexity becomes the delta log n[Yes [Yes] Lucene3.0 many places use this data structure to improve the search speed. But because its support for fuzzy queries is not very good, Lucene has now switched to FST.

One more word about Delta: it is the fourth letter of Greek, capital is. I am so lazy, I do not want to copy a lowercase letter to here, the capital letter is also because I directly switch to the Japanese input method, hit a [triangle] out of ~ ~. The delta is used for both the difference rule and the time complexity. Because the uppercase Δ in math or physics is used to represent the increment symbol. Lowercase is usually used in higher mathematics to represent variables or symbols. So the meaning of capitalization is used in the difference rule, and the meaning of its lowercase is used in the complexity of time.

Finite automaton algorithm (fst,finite State transducer): Constructs a least-direction-free graph by entering an ordered string. By sharing the prefix to save space, memory holds the prefix index, and the disk holds the suffix word block. The specific implementation of Lucene can be seen in the source code.

Lucene has so frequent version upgrades, I have been specialized in the play-like care of this upgrade, because there is a problem in the process of the occurrence and resolution, for example: In the Windows system, a folder can only hold more than 2W files, In the case of more than 1W files, the write speed will drop sharply, and Lucene's system of processing terabytes of data should consider the relationship and tradeoff between data volume and performance.

Finite automata is the core search algorithm of Lucene, which takes some time to understand. The following is a description of Lucene's scoring rules, which is easy to understand.

Document Weight: The weight value that is set for a document at index time.

Domain weight (field boost): The weight value that is set for a domain at query time.

Adjustment factor (Coord): Based on the number of query keywords included in the document to calculate the adjustment factor. In general, if more query keywords appear in a document than other documents, the larger the value.

Inverse file frequency (inerse document frequency): a factor based on term, the meaning of existence is to tell the degree of rarity of a word in a scoring formula. The lower the value, the rarer the word (the value here refers to the simple frequency, that is, how many documents appear in the word, rather than the formula of the IDF in Lucene). The scoring formula uses this factor to increase the weight of documents containing rare words.

Length Normalization (norm): a normalization factor based on a domain. The value is determined by the number of terms in the given field (it has been calculated at the time the document was indexed and stored in the index). The longer the text, the less the weight of the factor. This indicates that the Lucene scoring formula is biased toward a field that contains fewer term documents.

Word frequency (term frequency): a factor based on term. Used to describe the number of occurrences of a given term in a document, the larger the word frequency, the greater the score of the document.

Query normalization factor (norm): a normalization factor based on a query statement. Its value is the sum of squares of the weights of each query word in the query statement. The query normalization factor makes it possible to compare the scores of different query statements, but it is not always easy to achieve and feasible to compare the scores of different query statements.

See the basic rules and algorithms that Lucene source must know

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

See Lucene source code must know the basic rules and algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

See Lucene source code must know the basic rules and algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support