Real-time search engine Zoie implemented by LinkedIn

Source: Internet
Author: User
I. Overall Architecture

Zoie is a real-time search engine system implemented by linkedin Based on Lucene. Its official wiki is described as follows:

Http://snaprojects.jira.com/wiki/display/ZOIE/Overview

Zoie is a realtime indexing and search system, and as such needs to have relatively close coupling between the logically distinct Indexing and Searching subsystems: as soon as a document made available to be indexed, it must be immediately searchable.

The ZoieSystem is the primary component of Zoie, that insigates both Indexing (via implementingDataConsumer <V>) And Search (via implementingIndexReaderFactory <ZoieIndexReader <R extends IndexReader>).

Zoie is a real-time search engine system, which requires a close combination of logically independent indexes and search subsystems, so that once a document is indexed, you can be immediately searched.

ZoieSystem is an important component of Zoie. On the one hand, it implements the index function by implementing the DataConsumer interface, and on the other hand, it completes the search function by implementing IndexReaderFactory <ZoieIndexReader <R extends IndexReader>, and closely combine the two.

The following figure shows the overall architecture of ZoieSystem:

  • For the index system, ZoieSystem is a DataConsumer, that is, a consumer. Its function consume is used to consume the DataEvent object to complete the index function.
  • Since it is a consumer, the data provided to it should be the producer DataProvider. To use Zoie to establish a real-time search system, you must provide your own producer.
  • For the search system, ZoieSystem is an IndexReaderFactory, that is, a factory for IndexReader that can read the index. It has the function getIndexReaders to get all the IndexReader lists, this allows you to read index data.
  • Readers familiar with Lucene should be very clear. To search for Lucene indexes, you must first obtain IndexReader and then generate IndexSearcher Based on IndexReader to search, collect results, and score, sorting and other processes. Since IndexReader can be obtained through the Zoie factory, you need to implement your own search logic.

2. configure a ZoieSystem

ZoieSystem can be configured using spring. A typical configuration is as follows:

<! -- An instance of a DataProvider:

FileDataProvider recurses through a given directory and provides the DataConsumer

Indexing requests built from the gathered files.

In the example, this provider needs to be started manually, and it is done via jmx.

An instance of DataProvider:

FileDataProvider recursively accesses a specified path and constructs the obtained file into an index request to provide it to DataConsumer.

In this example, the producer needs to start manually through jmx.

-->

<Bean id ="Dataprovider"Class =" proj. zoie. impl. indexing. FileDataProvider ">

<Constructor-arg value = "file :$ {source. directory}"/>

<Property name = "dataConsumer" ref = "indexingSystem"/>

</Bean>

<! --

An instance of an IndexableInterpreter:

FileIndexableInterpreter converts a text file into a lucene document, for example

Purposes only

An IndexableInterpreter instance:

In this example, FileIndexableInterpreter converts a text file into a Lucene Document Object.

From the above introduction, we know that DataProvider, as a producer, produces DataEvent objects for consumption by the consumer DataConsumer. However, because Zoie is ultimately based on Lucene, Lucene cannot index DataEvent objects, in this case, someone is responsible for converting DataEvent to Lucene's Document object, and controlling which fields to add and what types of fields to be added according to application requirements. This is done by the translator Interpreter.

-->

<Bean id ="FileInterpreter"Class =" proj. zoie. impl. indexing. FileIndexableInterpreter "/>

<! -- A decorator for an IndexReader instance:

The default decorator is just a pass through, the input IndexReader is returned.

Decorator of an IndexReader:

The default decorator does nothing and returns the original IndexReader.

Note that an important design mode is used here. The encapsulated IndexReader is the IndexReader that directly opens the Lucene index. After obtaining the IndexReader, IndexReaderFactory encapsulates the IndexReader and returns it to the user. Some basic things will be loaded and initialized when IndexReader is opened in Lucene. However, sometimes you need to load some of your own things at the same time when IndexReader is opened, this class gives users the opportunity to implement their own decoration. Boboboindexreaderdecorator is implemented in Bobo of the same project as Zoie (implementing Facet search, and those who have used Solr may be familiar with it). Its function is to enable the function when IndexReader is enabled, load the Facet information into the memory to form a certain data structure, so as to quickly use it when collecting Facet.

-->

<Bean id ="IdxDecorator"Class =" proj. zoie. impl. indexing. DefaultIndexReaderDecorator "/>

<! -- A zoie system declaration, passed as a DataConsumer to the DataProvider declared above

A ZoieSystem statement is passed in as a DataConsumer in the preceding DataProvider statement.

-->

<Bean id ="IndexingSystem"Class =" proj. zoie. impl. indexing. ZoieSystem "init-method =" start "destroy-method =" shutdown ">

<! -- Disk index directory index folder -->

<Constructor-arg index = "0" value = "file :$ {index. directory}"/>

<! -- Sets the interpreter setting translator -->

<Constructor-arg index = "1" ref = "fileInterpreter"/>

<! -- Sets the decorator to set the decorator -->

<Constructor-arg index = "2">

<Ref bean = "idxDecorator"/>

</Constructor-arg>

<! -- Set the Analyzer, if null is passed, Lucene's StandardAnalyzer is used

Sets the word divider. If it is null, the default Lucene StandardAnalyzer is used.

-->

<Constructor-arg index = "3">

<Null/>

</Constructor-arg>

<! -- Sets the Similarity, if null is passed, Lucene's DefaultSimilarity is used

Set the similarity scorecard. If it is null, use the default defasimsimilarity of Lucene.

-->

<Constructor-arg index = "4">

<Null/>

</Constructor-arg>

<! -- The following parameters indicate how often to triggered batched indexing,

Whichever the first of the following two event happens will triggered indexing

The following two parameters indicate the frequency of triggering a batch index. If any of the two parameters meets the condition, the index is triggered.

-->

<! -- Batch size: how many items to put on the queue before indexing is triggered

Batch Size: the number of items in the queue before the index is triggered

-->

<Constructor-arg index = "5" value = "1000"/>

<! -- Batch delay, how long to wait before indxing is triggered

Batch latency: that is, how long it takes to trigger the index

-->

<Constructor-arg index = "6" value = "300000"/>

<! -- Flag turning on/off real time indexing

Indicates whether to enable real-time index.

-->

<Constructor-arg index = "7" value = "true"/>

</Bean>

 

<! -- A search service: a search service -->

<Bean id = "mySearchService" class = "com. mycompany. search. SearchService">

<! -- IndexReader factory that produces index readers to build Searchers from

ZoieSystem serves as IndexReaderFactory to provide the IndexReader list for the search service so that it can construct a Searcher.

-->

<Constructor-arg ref = "indexingSystem"/>

</Bean>

 

After reading the ZoieSystem configuration, let's first take a look at how the ZoieSystem constructor uses these parameters for initialization:

(1) it generates a defadirectdirectorymanager _ dirMgr Based on the index folder $ {index. directory}, which is used to manage the index folder and the index version number IndexSignature.

(2) generate a SearchIndexManager _ searchIdxMgr, which is a key class for real-time search, including the following member variables:

  • DefaultDirectoryManager generated in step 1
  • IndexReaderDecorator _ indexReaderDecorator
  • DefaultDocIDMapperFactory _ docIDMapperFactory is used to maintain the correspondence between Zoie's Document ID and Lucene's Document ID.
  • DiskSearchIndex _ diskIndex is used to operate the index on the hard disk. In this case, an IndexReader pointing to the hard disk index is obtained.
  • Status _ diskIndexerStatus: the Status of the current index. There are two statuses: Sleeping and Working. The so-called Sleeping means that newly added documents only enter the memory index, the so-called Working means that one of the memory indexes is being merged with the index on the hard disk. We will discuss it in detail in the next section of the real-time mechanism.
  • The Mem _ mem structure is the key to using two memory indexes and one hard disk index to implement real-time indexes. The detailed mechanism will be discussed in the next section. The Mem structure includes the following parts:
    • RAMSearchIndex <R> _ memIndexA is used to operate memory index.
    • RAMSearchIndex <R> _ memIndexB is used to operate memory index B.
    • RAMSearchIndex <R> _ currentWritable is based on the index status. Sometimes A is used to add the memory index of the new document, and sometimes B is used to add the index of the new document.
    • RAMSearchIndex <R> _ currentReadOnly is the opposite of the previous one. This is the memory index that will not be added to the new document. We can see from the discussion below, the memory index is being merged with the index on the hard disk.
    • ZoieIndexReader <R> _ diskIndexReader IndexReader

(3) Assign the parameter value to the member variables ZoieIndexableInterpreter _ interpreter, Analyzer _ analyzer, and Similarity _ similarity.

(4) create a disklucendexdexdataloader _ diskLoader object for indexing to hard disk indexes.

(5) If the real-time index _ realtimeIndexing is set to true, create RealtimeIndexDataLoader _ rtdc. In Step 4, _ diskLoader is used as its member variable. Set it to the member variable setDataConsumer (_ rtdc) of AsyncDataConsumer, parent class of ZoieSystem)

Iii. How Zoie implements real-time search 3.1. How does one use two memory indexes and one hard disk index to implement real-time search?

(1) When the system starts, the index is in Sleeping state. In the Mem structure, only index A, index B is null, index A is _ currentWritable, and _ currentReadOnly is null, _ diskIndexReader is the IndexReader of the hard disk index. Because the IndexReader of the index in the memory is updated immediately after each document is added, and the speed is fast, once the index on the hard disk is opened, it will be used until the next merge, the newly added documents can be searched immediately.

(2) when the number of documents in A reaches A certain level, it is necessary to merge the indexes on the same hard disk, so it is necessary to enter the Working status. Merging is a relatively long process. At this time, memory index B is created, and all the newly added documents are indexed to B. In this case, memory index A, memory index B, index A is currentReadOnly, index B is currentWritable, and diskIndexReader is the hard disk index IndexReader. To obtain the IndexReader of ZoieSystem, all three indexreaders are returned. Because IndexReader of index B is updated immediately after the document is added, the newly added documents can be searched immediately, at this time, although index A has been merged with the same hard disk index, the data in index A is not repeatedly searched because the IndexReader of the hard disk index has not been re-opened.

(3) After the data in index A has been fully merged into the hard disk, you need to re-open the IndexReader of the hard disk index. After opening the data, create A new Mem structure, the original index B is used as index A, which is currentWritable. The original index A is discarded, set to null, currentReadOnly is also set to null, and diskIndexReader is the IndexReader of the newly opened hard disk index. Then, replace the old Mem structure with the new Mem structure through seamless switching, and then the index enters the Sleeping state.

 

 

3.2 update Related Documents

In the previous section, we can see that the real-time search for newly added documents is relatively simple, but it is relatively complicated when documents are updated.

How to delete documents that have been indexed on the hard disk in real time is a big problem. Therefore, Zoie implements ZoieSegmentReader:

  • The member Variable _ decoratedReader is ZoieSegmentReader that decorizes Lucene's IndexReader by a decoration device specified by the user and then encapsulates a layer.
  • Long [] _ uidArray is a ing from Lucene's Document ID to Zoie's Document ID. Lucene's Document ID is a subscript, and Zoie's Document ID is the value of the corresponding item.
  • IntRBTreeSet _ delDocIdSet indicates the Lucene Document ID deleted in this index.
  • In the index, Zoie's Document ID is saved as the Payload information of each Lucene document number in the inverted table of a special Term ("_ ID", "_ UID, save as follows. Its fillentid function is to put Zoie's Document ID into Payload.
  • When you want to delete a document from ZoieSegmentReader, call the markDeletes function to convert the Zoie document number of the document to be deleted to the Lucene document number through DocIDMapper, and add the Lucene document number to _ delDocIdSet
  • Readers familiar with Lucene should know that IndexReader obtains inverted tables from indexes through the TermDocs interface, and Zoie also implements its own ZoieSegmentTermDocs, which has a DocIdSetIterator as a member variable, it is generated by ZoieSegmentReader to pass its _ delDocIdSet traversal tool to it. Whenever the next document number is obtained, it will filter out some document numbers in DocIdSetIterator. For TermPositions, ZoieSegmentTermPositions is also implemented.
  • ZoieSegmentReader enables slow operations to delete documents from hard disk indexes to quickly mark files in memory, and you do not need to re-open IndexReader to delete the files, the update integrity is also ensured (the update operation is a delete operation, plus an add operation. The newly added documents are initially in the memory index, and the delete operation should also be marked in the memory, otherwise, a new version will be lost and the old version will be deleted once the system crash is used, which is hard to implement even with the redo mechanism ).

With ZoieSegmentReader, let's take a look at the real-time search mechanism when documents are updated.

(1) when the system was initially started, it was in Sleeping state. At this time, the memory index was empty and there were documents A, B, and C on the hard disk index.

(2) When file B is updated in Sleeping state, file B enters the memory index, and disk index B is marked to be deleted.

 

(3) When the memory index is large enough, the index enters the Working state and enters the merge process. During the merge process, the files marked as deleted in the hard disk index are deleted first, and then the memory index is merged into the hard disk index. In this case, if there is A new update entry, for example, update document A, it will mark the deletion in another memory index and hard disk index, and then add the new document to the memory index.

(4) After the merge, the hard disk index marks the deletion of the original documents marked as deleted in the memory index, and all the merged and marked deleted documents are discarded, the index enters the Working status.

 

Iv. Zoie's indexing process 4.1. Add documents to the memory index

(1) The Zoie index process starts with calling the consume function of ZoieSystem in DataProvider. In fact, it calls the consume (Collection <DataEvent <V> data) function of AsyncDataConsumer, it only places the DataEvent in the shortlist <DataEvent <V> _ batch.

(2) AsyncDataConsumer has a backend thread ConsumerThread _ consumerThread, which calls _ consumer. consume (currentBatch), which is known by step (5) in the constructor of ZoieSystem. The _ consumer here is RealtimeIndexDataLoader _ rtdc.

(3) The RealtimeIndexDataLoader. consume function involves several steps:

  • Call the convertAndInterpret function of _ interpreter to convert all dataevents to ZoieIndexable and put it in the ArrayList <DataEvent <ZoieIndexable> indexableList of the linked list. ZoieIndexable encapsulates Lucene's Document
  • When RealtimeIndexDataLoader is created, in addition to the DiskLuceneIndexDataLoader passed in as the member Variable _ luceneDataLoader, the member variable ramlucendexdexdataloader _ ramConsumer is also created for indexing to memory indexes. After completing the preceding step, call _ ramConsumer. consume (indexableList) to index the ZoieIndexable to the memory.

(4) The consume function of RAMLuceneIndexDataLoader calls the consume function of eindexdexdataloader, which includes the following steps:

  • Obtain RAMSearchIndex idx.
  • Zoie updates all documents, puts the Document ID in LongOpenHashSet delSet, and puts IndexingReq of the Document encapsulated by Lucene into List <IndexingReq> docList.
  • For each document, use ZoieSegmentReader. fillDocumentID (doc, uid) to add the Zoie Document ID to Payload.
  • Update the memory index idx. updateIndex (delSet, docList, _ analyzer, _ similarity). Delete the index with IndexReader and add it with IndexWriter.
  • Of course, in addition to deleting the memory index, the document to be deleted should also be filtered out in another memory index and hard disk index. Therefore, call the propagateDeletes (LongSet delDocs) function of RAMLuceneIndexDataLoader:
    • First, get another memory index, which should be ReadOnly and is being merged with the hard disk index: RAMSearchIndex <R> readOnlyMemoryIdx = _ idxMgr. getCurrentReadOnlyMemoryIndex ()
    • Mark the deletion in the memory index of ReadOnly so that it can be filtered out during search. readOnlyMemoryIdx. markDeletes (delDocs)
    • Then obtain the hard disk index, DiskSearchIndex <R> diskIdx = _ idxMgr. getDiskIndex ()
    • Mark the deletion in the hard disk index, diskIdx. markDeletes (delDocs), so that it can be filtered out in the search
4.2 merge memory indexes into hard disk Indexes

The parent class of RealtimeIndexDataLoader is BatchedIndexDataLoader. It has a thread named LoaderThread that calls the processBatch function.

The procedure of the RealtimeIndexDataLoader processBatch function is as follows:

(1) when the number of documents in the memory index exceeds the configured batch size or the time exceeds the configured _ delay, the memory index is merged into the hard disk index.

(2) set the index Status from Sleeping to Working, _ idxMgr. setDiskIndexerStatus (SearchIndexManager. Status. Working)

  • Reconstructs the Mem <R> _ mem Structure
  • Originally, memIndexA used to add new documents in Sleeping state becomes _ currentReadOnly
  • The memIndexB used to add a new document in the Working state is created as _ currentWritable.
  • In the merge phase, the hard disk index IndexReader is still the old IndexReader.
  • We can also see from the code that the memory index A and B are switched to the location: Mem <R> mem = new Mem <R> (memIndexA, memIndexB,MemIndexB, memIndexA, Olf8. get_diskIndexReader ());

(3) obtain the memory index readOnlyMemIndex = _ idxMgr. getCurrentReadOnlyMemoryIndex () to be merged ()

(4) Merge the memory index into the hard disk index: _ javasedataloader. loadFromIndex (readOnlyMemIndex), and diskinclueindexdataloader's loadFromIndex function to do the following:

  • Obtain DiskSearchIndex <R> idx = getSearchIndex ()
  • Idx. loadFromIndex (ramIndex). First, use IndexReader to delete the marked document, and then call the addIndexesNoOptimize function of IndexWriter to merge the memory index to the hard disk.
  • Refresh the hard drive index IndexReader, idx. refresh ()
  • Idx. markDeletes (ramIndex. getDelDocs () inherits the document marked to be deleted in the memory index

(5) set the index Status from Working to Sleeping, _ idxMgr. setDiskIndexerStatus (Status. Sleep)

  • Reconstructs the Mem <R> _ mem Structure
  • MemIndexB in the Working state is paid to memIndexA and currentWritable, while memIndexB is set to null, that is, B is treated as A, without B
  • Mem <R> mem = new Mem <R> (olf8. get_memIndexB (), null, olf8. get_memIndexB (), null, diskIndexReader)
  • LockAndSwapMem seamlessly switches the Mem Structure
5. Zoie search process

When using Zoie for search, you need to call the getIndexReaders () function of ZoieSystem, which calls _ searchIdxMgr. getIndexReaders ().

The getIndexReaders function of SearchIndexManager obtains the IndexReader of RAMSearchIndex <R> memIndexA, The IndexReader of RAMSearchIndex <R> memIndexB, And the IndexReader of the hard disk index respectively. Two indexreaders are obtained in Sleeping state, and three indexreaders are obtained in Working state.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.