Lucene in a cluster
Lucene is a highly optimized inverted index search engine. it stored a number of inverted indexes in a custom file format that is highly optimized to ensure that the indexes can be loaded by searchers quickly and searched efficiently. these structures are create so that they are almost completely pre-computed.
Lucene is a highly optimized inverted index search engine. It stores inverted indexes in custom file formats, and the file formats are highly optimized to ensure quick loading and effective search by the searcher. Lucene generates these structures so that the indexes are almost completely pre-computed.
To store the index, Lucene uses an implementation of a 'directory' interface,NotTo be confused with anything in Java. io .. the standard implementation if fsdirectory that stored the search index on a file system. there a number of other implementations that can bused including ones to split the index on the filesystem into smaller chunks, and ones to distribute the index throughout a cluster using map reduce (see Google ). there is additionally a database implementation that stored the index as blocks in a database.
Lucene stores indexes by using the Directory Interface. Be careful not to associate directory with Java. io obfuscated. fsdirectory is a standard implementation of the Directory interface. It stores indexes in the file system. there are also some other implementations, such as some that save small data blocks for index splitting in the file system, and some that distribute indexes by using map reduce (see Google) clusters. there is also a database implementation that stores indexes as data blocks in the database.
Lucene derives its speed from this index structure, and to work really well it needs to be able to seek efficiently into the blocks of the segments that make up the index. this is trivial where the underlying storage mechanisms supports seek, but less trivial if the storage mechanisms does not. the fsdirectory is based on files, and is efficient in this area. if the files are on a local file system, pure seeks can be used. if the index is on a shared file system, there will always be some latency and potentially increased Io traffic. the database implementation is highly dependent on the Blob implementation in the target database and will nearly always be slower than the fsdirectory. some databases support seekable blobs (Oracle), some emulate this behavior (MySQL with emulatelocators = true), others just don't support it and so are really slow. (and I mean really slow)
(Lucene is fast because of its index structure. To achieve excellent search, Lucene needs to effectively address sector blocks, which constitute an index. if the underlying storage mechanism supports such addressing, nothing can be said. If not, this is the problem. in this case, the file-based fsdirectory is valid. if the index file is stored in the local file system, the access effect is good. if it is stored in a shared file system, there will always be some latency and potential Io blocking. the implementation method of the database mentioned above is highly dependent on the implementation of the target Database BLOB, and it is almost always slower than fsdirectory ). some databases support addressable blob, such as oracle. mySQL also simulates this behavior (when you set the MySQL parameter emulatelocators to true), other databases do not support it, so it is really slow (I mean it is actually very slow)
All of this impacts how Lucene works in a cluster. each node discovery Ming the search needs access to the index. to make search work in a clustered environment we must provide this. there are 3 ways of doing this.
- Use a shared file system between all nodes, and use fsdirectory.
- Use indexes on the nodes local file system and a synchronization strategy.
- Use a database using jdbcdirectory
- Use a Distributed File System (eg Google file system, nutch Distributed File System)
- Use a local cache with backup in the database
All of these will affect Lucene in the cluster environment. Each node that performs the search needs to access the index. Therefore, to enable search in the cluster environment, we must provide shared index files. There are three ways to provide reference.
1) use shared files across all nodes and use fsdirectory
2) use the indexes in the local file system of the node and keep the synchronization between nodes.
3) using jdbcdirectory to operate the database)
4) use distributed file systems (such as Google file systems and nutch Distributed File Systems)
5) use the local cache to Save backups in the database
Shared filesystem
There are a number of issues with a shared file system. performance is lower than a local file system (obviusly), unless a SAN is used, but a San shared file system must be a true San File System (eg RedHat Global File System, apple xsan) as modifications to the file system blocks must be mirrored instantly in the block cache of all connected nodes, otherwise they will see a specified upted file system. remember a SAN is just a networked block device, that without additional helpCannotBe shared by multiple compute nodes at the same time. provided the performance of the shared file system is sufficient, Lucene works well like this with no modifications using the fsdirectory implementation. the implementation of the lock managed in the Sakai Search Component eliminates problems with locks reported by the Lucene community.
This mechanism is available now in Sakai search.
There are some problems with using the shared file system. the performance is significantly lower than that of the local file system. if a San cannot be used, a San shared file system must be a real SAN file system, such as RedHat Global File System and Apple xsan. modifications to the file system block must be immediately mirrored to the block cache of all connected nodes; otherwise, the file system will crash. remember that San is a networked block device and cannot be shared by multiple computing nodes at the same time without additional help. if the performance of the shared file system is guaranteed, Lucene's implementation of fsdirectory will continue to be outstanding. the lock implemented by the Sakai Search Component solves the issue of the locks submitted by the Lucene community.
The mechanism mentioned above is now available in Sakai search
Synchronized local indexes.
Where the architecture of the cluster is a shared nothing architecture, the Lucene indexes can be written to local disk and synchronized at the end of each index cycle. this is an optimal deployment of Lucene in a cluster as it ensures that all the IO is from the local disk and is hence fast. to ensure that there is always a back up copy of the index, the synchronization wowould also target a backup location.
The difficulty with this approach is that without support in the implementation of the search engine, it requires some deployment support. this may involve include making hard link mirrors to speed up the synchronization process. lucene indexes are suitable for synchronizing with rsync which is a block based synchronization mechanic.
The main drawback of this approach is that the full index is present on the local machine. in large search environments, this duplication will be wastefully, however in search engine terms, a single deployment of Sakai will probably never get into the large space (large> 100 m documents, 2 TB index)
This mechanism is available, but requires local configuration
This cluster architecture is not a shared architecture. Lucene indexes are written to local disks and then synchronized to each node after the index is created. this is the best Deployment Solution for Lucene in the cluster environment, because it ensures that all Io is saved on the local disk so quickly. to ensure that there is always an index backup, a backup path should be created for synchronization.
The difficulty of this solution lies in the lack of support for search engine implementation, which requires deployment support. this may use a connection image to accelerate the synchronization process. lucene indexes are suitable for rsync synchronization. rsync is a block-based synchronization mechanism.
The main drawback of this solution is that all index files are saved locally. in a large search environment, such replication is a waste of space. however, according to the search engine conditions, a single deployment of Sakai will never occupy much space (the index of over 100 MB of documents occupies 2 TB)
Database hosted search index.
Where a simple cluster setup is required, a database hosted search index is straightforward option. there are however significant drawbacks with this approach, most notable being the drop in performance. the index is stored as blocks in blobs inside the database. these blobs are stored in a block structure to eliminate most of the unnecessary loading however each blob bypasses any local disk block cache on the local machine and has to be streamed over the network. if the database supports seekable blobs, within the database itself, it is possible to minimize unnecessary network traffic. oracle has this support. however where the database only emulated this behavior (MySQL) the performance is poor as the complete blob needs to be streamed over the network. in addition to this the speed of access is slower since a SQL statement has to be executed for each data access.
The net result is slower performance.
This mechanical IMS is available, but performance is probably unacceptable
However, to implement a simple cluster, using database-based indexes is a direct choice. however, this method has obvious defects. the most noteworthy is performance degradation. indexes are saved as blob in the database. these blob are saved in block structure to eliminate unnecessary loading. however, each disk quickly bypasses any local disk block cache and must pass data streams over the network. if the database supports addressable blocks, it is possible to minimize unnecessary network congestion for the database itself. oracle provides such support. However, when the database only simulates such behavior (such as MySQL), the performance is poor when the entire blob needs to be streamed and then transmitted over the network. in addition, the access speed is relatively slow, because SQL statement must be executed during data access.
In short, the performance will decrease.
This mechanism is usable, but the performance is probably unacceptable.
Distributed File System
Real search engines use a distributed file system that provides a self healing file system where the data itself is distributed into SS multiple nodes in such a way that the file system can recover from the loss of one or more nodes. the original file system of this form is the Google file system and the nutch Distributed File System is modeled on Google file system. both implementations use a gather scatter algorithm detailed by Google in map-reduce (see Google Labs ).
This approach results in every node containing a part of the file system. where the index size has grown to such an extent to make the storage of the complete index on every node in the cluster, this approach becomes more attractive.
At the moment there are no plans to provide an implementation of a distributed file system within Sakai.
Some real search engines use distributed file systems. this distributed system provides an autonomous system where data is distributed across multiple nodes. Such a system can be recovered from damages to one or more nodes. google file system and nutch Distributed File System (Modeling in. google File System) is such an example. two implementations adopt an aggregation scanning algorithm. Google details this algorithm in map-reduce (see Google Labs)
This method makes each node contain a part of the file system, but the index becomes so large that it becomes more attractive when saved on each node.
Sakai is not yet planning to implement distributed file systems
Database clustered local search
In this approach, indexes are used from local disk, but backed up to the database as Lucene segments. A cluster app node is installed, it synconizes the local copy of the search index with the database. when new content is added by one of the cluster app nodes, It updates the backup copy in the database. on reciept of the index reload events, all cluster app nodes resyncronize the with the database downloading changed and new search segments.
This mechanism in the process of being tested, I exhibits the same performance as a local basaed search for a 200 MB index with 80,000 documents.
Once this mechanism is completely tested it will become the default OOTB mechanism, as it works where there is a single cluster node or more than one cluster node. the added advantage of this mechanism is that the index is stored in the database.
It will also be possible to implement this mechanic with a shared filestore acting as the backup location.
In this method, the index is used through a local disk, but as a Lucene sector backup in the database, you can install a cluster application node to copy locally and synchronize between databases. when an index overload event is received, all cluster application nodes are synchronized with the database again to download updates and new search sectors.
This mechanism is in the testing stage. I found that when searching for 80000 MB indexes (including documents), this mechanism has the same performance as local indexes.
Once this mechanism has been fully tested, it will become the default OOTB mechanism when working in one or more cluster nodes. the additional advantage is that the index is stored in the data.
Shared File storage can also be used as a backup path to implement this mechanism.