Introduction to Solrcloud Principle _SOLR

Source: Internet
Author: User
Tags create index solr zookeeper
I. INTRODUCTION

Solrcloud is a distributed search scheme based on SOLR and zookeeper after the Solr4.0 version. Solrcloud is SOLR based on a zookeeper deployment approach. SOLR can be deployed in a variety of ways, such as stand-alone mode, multiple-machine master-slaver mode.

Two. Featured Features

Solrcloud has several features:

Centralized configuration information is centrally configured using ZK. At startup, you can specify that the relevant configuration file for Solr be uploaded to zookeeper, and be shared by multiple machines. The configuration in these ZK no longer has a local cache, and SOLR directly reads the configuration information in ZK. Configuration file changes that can be sensed by all machines. In addition, some of SOLR's tasks were published through ZK as a medium. The goal is for fault tolerance. A machine that receives a task, but crashes while performing a task, can perform this unfinished task again after restarting, or when the cluster elects a candidate.

Automatic fault tolerant Solrcloud to index fragmentation and creates multiple replication for each fragment. Each replication is available for external service. A replication hang off does not affect Indexing Service. More powerful, it can also automatically help you rebuild and put into use the index on the failed machine on other machines replication.

Near-real-time search for immediate push-back replication (also supports slow push). You can retrieve a new join index in seconds.

Automatic load Balancing Solrcloud index Multiple replication can be distributed on multiple machines, and the query pressure is balanced. If the query pressure is large, it can be slowed down by expanding the machine and increasing the replication.

An automatically distributed index and index fragment sends a document to any node and it is forwarded to the correct node.

The transaction log transaction log ensures that updates are not lost, even if the document is not indexed to disk.

Other features that are worth mentioning are:

Indexes are stored on the HDFs the size of the index is usually in G and dozens of g, with few hundred g, such a function may be difficult to practical. However, if you have billions of data to build an index, you can also consider. I think the best thing about this feature might be the combination of the following, "CREATE index by Mr".

With this feature, you're also worried about creating indexes slowly by creating indexes with Mr Bulk.

The powerful RESTful API you can always think of is a management function that can be invoked through this API. It's much easier to write some maintenance and management scripts.

Excellent management interface of the main information at a glance, you can clearly graphically see the deployment distribution of Solrcloud, of course, there are essential debug features.

Three. The concept

Collection: A logical, complete index in a solrcloud cluster. It is often divided into one or more shard, which use the same config Set. If the number of Shard is more than one, it is a distributed index, and Solrcloud lets you refer to it by collection name without concern for the Shard related parameters that need to be used for distributed retrieval.

Config SET:SOLR Core provides a set of configuration files that the service must have. Each config set has a name. The minimum needs include Solrconfig.xml (Solrconfigxml) and Schema.xml (Schemaxml), and other files may need to be included depending on the configuration of these two files. It is stored in zookeeper. Config sets can be uploaded or updated with the Upconfig command, using SOLR's startup parameters Bootstrap_confdir specify that it can be initialized or updated.

Core: Solr Core, a SOLR containing one or more SOLR core, each SOLR core can independently provide index and query functionality, each SOLR core corresponds to an index or collection SHARD,SOLR The core is proposed to increase management flexibility and share resources. A different point in Solrcloud is that it uses a configuration that is in zookeeper, and the traditional SOLR core profile is in the configuration directory on disk.

Leader: The Shard replicas that won the election. Each shard has multiple replicas, and these replicas need to be elected to determine a leader. Elections can occur at any time, but usually they are triggered only when a certain SOLR instance fails. When indexing documents, Solrcloud passes them to the Shard corresponding Leader,leader and distributes them to all shard replicas.

A copy of the Replica:shard. Each replica exists in a core of SOLR. A collection named "Test" is created with Numshards=1, and the specified Replicationfactor is set to 2, which produces 2 replicas, that is, the corresponding 2 core, each in a different machine or SOLR instance. One is named TEST_SHARD1_REPLICA1, and the other is named TEST_SHARD1_REPLICA2. One of them will be elected as leader.

Logical fragmentation of the shard:collection. Each shard is converted into one or more replicas, and the election determines which is leader.

Zookeeper:zookeeper provides distributed lock function, which is necessary for Solrcloud. It deals with leader elections. SOLR can run within zookeeper, but it is recommended to use a standalone and preferably more than 3 hosts.

Four. Architecture Chart

Logical Diagram of index (collection)





SOLR and Index control charts



Creating an indexing process



Distributed queries



Shard splitting



Five. Other

NRT near-real-time search SOLR's indexed data is written to disk at the time of submission, which is hard to commit, ensuring that no data is lost even when a power outage occurs; SOLR sets a soft submission to provide more real-time retrieval capabilities. Soft submission (soft commit): Data is only submitted to memory and index is visible and is not written to the disk index file at this time.

A common usage is to automatically trigger a hard commit every 1-10 minutes and automatically trigger a soft commit every second.

Realtime Get real-time access allows you to find the latest version of any document with a unique key, and you do not need to reopen searcher. This is primarily used to make SOLR a NoSQL data storage service, not just a search engine. Realtime get is currently dependent on the transaction log and is enabled by default. In addition, even soft commit or commitwithin,get can get real data. Note: Commitwithin is a data submission feature, not immediately, but rather requires submission of data within a certain period of time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.