Katta: Lucene-based Scalable Distributed Real-time search solution

Last Update:2018-12-06 Source: Internet

Author: User

Tags node server hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Katta-Lucene & more in the cloud.

Introduction
Katta is a distributed application running on many product hardware servers. It is very similar to hadoop mapreduce, hadoop DFS, hbase, bigtable, and hypertable.

Overview
The master node server manages slave node servers and index shards tasks. The slave node Server service index shards. The client allows you to search for data from all connected nodes and merge all the results into one and return them to the client.

Data Structure
Katta's index is a folder that contains a set of so-called index shards (file form ). These subfiles contain Lucene indexes.
Index shards can be easily created using Lucene's index writer. Creating a katta index only copies a group of Lucene indexes to a folder. Therefore, katta indexes can be created using hadoop map reduce (katta provides some tools). A separate server or anything can meet your requirements.
In this way, we can use the index structure that best suits our applications. For example, place a document containing common related terms in the same shard.

Communication between master and slave nodes
Communication between master and slave nodes is very important in distributed systems. The master node must be as block-aware as possible to determine whether the slave node has been mounted or not. This type of communication usually uses heartbeat messages to contact the master and slave nodes. However, katta uses a different method for implementation. Zookeeper, a distributed configuration and locking system, is a Yahoo research project used to implement communication between master and slave nodes. Zookeeper allows you to read and write data into a distributed virtual file system, although it is not a real file system. The slave node writes a temporary file to a folder named "/nodes" during startup. The master node agrees to change the folder. If a slave node fails, Zookeeper will remove the temporary file and send a notification to the master node. A similar program is used to handle errors on the master node. Although only the active master node writes a "/master" folder, level 2 masters subscribe to this file.
In katta, the communication between all master and slave nodes is implemented.
"/Index" --- this file is written when a new index is deployed.
"/Nodes-to-shards" --- the directory stores the folder of each slave node, which contains the list of index files allocated to the slave node.
"/Shards-to-nodes" --- the directory stores the folders of each slave node. In each folder, the list of index files deployed on the slave node is displayed.

Client node communication
After a query request is obtained, the client communicates with the slave node. For communication between the client and the node, we decided to use the hadoop RPC, it is a fast and easy-to-use JAVA Implementation of synchronous communication (APACHE Mina is also fast, but it is asynchronous communication ). For each search request, we share our index search with the servers that send the request to all nodes.
All requests are made into multiple threads, and hadoop RPC maintains an open tc p ip connection.

Load shards to a node
Because the performance is crucial in the search, katta first copies the shards to the local hard disk of the node.
Hadoop file systems can understand the URLs and use them as a source. For example, "file:" deploys an index from a local shards, and all nodes can access it. Of course, "HDFS:" can also provide an index for distributed systems deployed from hadoop. Amazon S3 also supports this-more details about hadoop file system files are involved.

Distributed rating
Katta provides distributed ratings-this is because we do not want this word to be fully balanced and allocated to all shards.
For each search query, there are two network round-trips in katta: First, we retrieve the file frequency from all the nodes, and then search for all the nodes to access according to this value. Please note that we also provide a simple counting method, which is only a file that matches the number query, but does not mean that the network is round-trip.

Integration
The Java API management system provided by katta can be integrated into your management and monitoring applications (for details, see the Java API of katta ).
Katta also provides Java APIs that can search for indexes (client. Java)-this will be a combination of points to connect the search results of your website or application.
Finally, katta provides a command line tool to manage system-level functions, such as deploying and canceling the deployed shards.

Official introduction:

Katta-Lucene & more in the cloud.
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and hadoop mapfiles.

Makes serving large or high load indices easy

Serves very large Lucene or hadoop mapfile indices as index shards on serving servers

Replicate shards on different servers for performance and fault-tolerance

Supports pluggable network topologies

Master fail-over

Fast, lightweight, easy to integrate

Plays well with hadoop Clusters

Apache version 2 license

Official homepage: http://katta.sourceforge.net/

Official documents: http://katta.sourceforge.net/documentation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More