Solr-in-action-ch1-Introduction to solr

Last Update:2016-04-23 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solr-in-action-ch1-Introduction to solr

Solr is an enterprise-level search engine that can be expanded and quickly deployed to search for massive text center data and sort the returned results in relevance.

Scalability: Solr can distribute indexing and query operations to multiple servers in a cluster.

Quick deployment: Solr is an open-source software. It is easy to install and configure. You can use Solr directly based on the Sample configuration in the installation package.

Optimized search functions: Solr search is fast enough. For complex search queries, Solr can be processed in sub-seconds. Generally, a complex query can be processed in dozens of milliseconds.

Massive text: Solr is designed for processing massive volumes of text with more than one million records. It can process massive volumes of data.

Text center data: Solr optimizes the search for text content that contains natural languages, such as emails, webpages, resumes, PDF documents, or social media content such as Twitter, Weibo, and blogs, it is applicable to Solr.

The results are sorted by correlation.: Solr searches for returned results are sorted based on the degree of relevance between the results document and the user query, to ensure that the most relevant results are returned first.

Why do I need a search engine?

It is applicable to five main features of data processed by search engines like Solr:

Text-centric data)

Reads far more data than writes

Document-oriented data

Flexible Schema

The search engine was designed to extract the implicit structure of text data and generate relevant indexes to improve the query efficiency. The term "text center data" implies that the text information in the document contains the query content that users are interested in. Of course, the search engine also supports non-text data, such as digital data, but its main strength lies in processing natural language-based text data.

All the above are "text". In fact, "center" is also very important, because if your users are not interested in the content of the text, then the search engine may not be the best choice to solve your problem. For example, for an application used by employees to create travel expense reports, each report includes structured data such as date, expense type, exchange rate, and quantity, in addition, some remarks may be provided after each charge to describe the general situation of the charge. Such an application contains text information, but is not an example of "text center data", because when the accounting department uses these employees' expense reports to generate monthly expense reports, the text in the remarks will not be searched. The text is not the main content of interest here. Simply put, not all data containing text information is suitable for search engines.

First, you must declare that Solr allows you to update the existing document content in the index. You can interpret "reading far more than writing" as the frequency of reading documents is much higher than that of creating and updating documents. But do not simply understand that you cannot write data at all, or you will be limited to updating data at a specific frequency. In fact, a key feature of Solr4 is "near real-time query", which allows you to index thousands of documents every second and query these newly added documents almost immediately.

The key point behind "reading much more data than writing" is that your data should be read multiple times in its lifecycle after being written to Solr. You can understand that a search engine is not mainly used to store data, but mainly used to query stored data (a query request is a read operation ).

A document is an independent set composed of fields. Each value can only store data values and cannot be nested with other values. In other words, in a search engine such as Solr, documents are flat and there is no dependency between documents. In Solr, the concept of "flat" is relatively loose. A value field can store multiple data values, but it cannot be nested with subvalue fields. That is to say, you can store multiple data values in a value field, but you cannot nest other value fields in the Value Field.

In a relational database, each row of data in a table must have the same structure. In Solr, documents can have different value ranges. Of course, documents in the same index should have at least some of the values that everyone has for retrieval, but not all documents have the same value structure.

For search engines, the returned result documents are sorted in descending order based on the scores, which indicates the matching degree between the documents and the query. The matching degree score is calculated based on a series of factors. However, the higher the score, the higher the correlation between the results documents and the query. If you want to manually intervene in the sorting results, you can add weights to a specific document, value range, or query string, or directly increase the relevance score of a document.

When returning the results of the document corresponding to the user's initial query, a tool is provided to the user so that the user can continuously improve the query to obtain more information. In other words, in addition to returning matched documents, you should return a tool to let the user know what to do next. For example, you can classify query results by attributes to facilitate further Browsing Based on your needs. This function is called Faceted-Search, which is also one of the highlights of Solr.

If a query result contains millions of documents, all matching documents must be returned at one time. Then you will wait for a long time. The query itself will be executed very quickly, but it is time-consuming to rebuild millions of documents from the index structure. Because Solr and other search engines store Value Domains on hard disks only apply to quickly generating a small number of document results. If you need to generate a large number of query results at a time, in this storage mode, it takes a lot of time to generate a large number of document results.

Another scenario that is not suitable for use by search engines is the scenario where most subsets of index files need to be read to complete deep analysis tasks. Even if you use the results paging technology to avoid the problem you just mentioned, if an analysis needs to read a large amount of data in the index file, you will also encounter a huge performance problem, because the underlying data structure of the index file is not designed for a large number of reads at a time.

The search engine technology is not suitable for querying between documents. Solr does support query based on parent-child relationship, but does not support query between complex relational data structures.

Information Retrieval (IR) refers to a collection of massive data (usually stored in computer systems) based on a certain amount of unstructured essential attributes (usually text content) find out the process of meeting the information requirements (usually documents.

Lucene uses JAVA at the underlying layer to establish and manage inverted indexes. An inverted index is a special data structure used to match text queries.

Although Lucene provides a core architecture for document indexing and query execution, it does not provide a convenient interface to set how indexes should be created. To use Lucene, you need to write some JAVA code to define the value range and how to analyze these value ranges. Solr provides a simple declaration method to define the index structure. You can also specify how to analyze the value fields in the index as needed. All of this can be done through an xml configuration file named schema. XML. Solr translates the schema. xml configuration into Lucene index in its underlying implementation.

In addition, Solr adds some other good functions to the core Lucene indexing function. Specifically, Solr provides two new value types: Copy Fields and Dynamic Fields. Copy Fields provides a method to assign the original text content from one or more value ranges to another new field value range. Dynamic Fields allows you to assign the same value type to multiple value ranges without explicit declaration in schema. xml.

As a Java web application, Solr can run on any modern JAVA Servlet Engine, such as Jetty or Tomcat, it can also run on a complete J2EE application server such AS JBoss or Oracle. To achieve the goal of easy integration, Solr core services must be accessible by different applications and programming languages. Solr provides simple restful services that support standards such as XML, JSON, and HTTP. By the way, we do not use the RESTFul term to describe Solr's HTTP-based API, because it does not strictly abide by all the REST (Representatonal state transfer) principles. Solr provides libraries for many popular programming languages, including Python, JAVA,. NET, and Ruby.

Solr supports running multiple Solr cores on a single Solr engine. Each core has an independent index and configuration. Multiple Solr cores can exist in one Solr instance. In this way, you only need one Solr server to manage multiple cores, so that you can easily share server resources and monitor and maintain services. Solr has dedicated APIs for creating and managing multiple cores. One application of Solr's multi-core support function is Data Partitioning. For example, one core is used to take charge of the latest documents, and another core is used to process the previously generated documents, this function is called chronological sharding.

Solr has three major subsystems: document management, query processing, and text analysis. Of course, these subsystems are the macro abstraction of complicated subsystems in Solr. We will study these subsystems one by one in later chapters. Each of these modules is serialized by a series of functional modules. You can concatenate new functional modules in the pipeline. This means that if you want to add a new function to solr, you do not need to rewrite the entire Query Processing Engine. You only need to enter your new function module in the appropriate position. In this way, Solr's core function modules can be easily expanded and customized, and can be fully customized according to your specific application requirements.

The first card to achieve scalability in Solr is the flexible cache Management function, which can avoid repeated resource-consuming operations on the server. Specifically, Solr sets cache in advance to save a lot of overhead for Repeated Computation. For example, Solr caches the computation results of the query filter.

The Caching function is limited. to process more documents and obtain higher query throughput, you need to be able to scale the system performance horizontally by extending the server.

Two most common aspects of Solr Scaling:

The first is the expansion of query throughput and the increase of replicas. To obtain higher query throughput, you need to increase the number of copies of the query server and index, so that more servers can process more requests at the same time.

Another extended dimension is the number of indexed documents. By adding shards, index files are split into small pieces called "shards" and query requests are distributed to these shards for operations.

Add replica to each shard. When a shard crashes, all requests are sent to the backup location.

This article mainly introduces the three types

Solr provides a series of important functions to help you build an easy-to-use, intuitive, and powerful search engine. However, you must note that Solr only provides restful HTTP APIs and does not provide UI components and frameworks related to the search interface.

Paging and sorting

Solr does not return all results that meet the query conditions. Solr optimizes the query results of paging requests. Each time only the top N documents are returned when the first page of results is requested. If the user does not find the desired information in the results on the first page, the user can obtain the subsequent page number through simple API calls and request parameters. The paging function is helpful for two types of key outputs: 1) results are returned faster, because each query only needs to return a small set of the entire search results; 2) it can help you track how many requests target more page content. This indicator shows whether your correlation score is calculated incorrectly.

CATEGORY search

The classification search function classifies search results into groups based on their features, which provides users with a tool to continuously Optimize search keywords and browse search results. It is actually the navigation function.

Automatic completion

The auto-Fill function automatically fills in the keyword based on the document content in the system index file. The Automatic completion feature of Solr allows you to obtain a list of query words recommended based on these input characters by entering a few characters. This can greatly reduce the chance of users entering wrong query words, especially many users now enter search content on mobile devices with a keypad.

Spelling check

When you enter a query word with spelling errors, you still expect the search engine to handle these small errors elegantly and return the correct query results to the user. Solr supports two basic spell checking modes:

Automatic Error Correction Mode: Solr can automatically correct a spelling error based on whether the word exists in the index.

"Are you looking... "Function: Solr can also suggest a better input scheme based on the user input. For example, when the user inputs "hilands", solr will suggest the user. "Are you looking for highlands? "

Highlight hit results

When you search for a document with a large amount of text, you can use Solr's highlighted hit results function to highlight the hit content. This is very useful in documents with long text content. You can use this function to easily find the Hit search content in the long text content.

Geographic location query

Geographic location search is a great feature in Solr 4. Solr 4 builds an index for longitude and latitude values to sort documents by geographical distance. Solr can locate the corresponding document records and sort the results based on the distance to a certain point in geographical location (a specific point in longitude and latitude. Another exciting feature in Solr 4 is that you can even draw a variety of geometric figures on a map, such as polygon, to query Geographical locations based on the intersection of different shapes.

Data Modeling

Value range merging and grouping

Although Solr requires that the documents to be processed should be flat and non-normalized as much as possible, it still allows you to group and manage multiple documents according to some common attributes. A value group, also known as a combination of values, allows you to return a specific document group in addition to returning documents.

Flexible query support

Solr provides a series of powerful query functions, including:

Conditional logic with (and), or (or), not
Wildcard matching is supported.
Supports Range Query of dates and numbers
Fuzzy search supports fuzzy string matching.
Regular Expression matching
Support function Query
Connection Function
However, in Solr, join operations are more like subqueries in SQL, but you do not create new documents by linking data between documents. For example, with the join function of Solr, you can return sub-documents whose parent documents meet the query conditions. The Solr connection function is useful when you need to get all comments from a tweet or Weibo. All comments are subdocuments of the original article.
Collection Function
The document collection function allows you to group similar documents according to the description of each document. This helps avoid returning a lot of similar document results when returning query results. For example, if your search engine is a news application that pushes articles through multiple RSS links, you may receive many reports about the same news at the same time. It is obviously not a good idea to return similar reports to users. At this time, you can use the document collection function to split these similar reports into one group, select a representative report and return it to the user.
Function of importing rich media data from documents in PDF and word formats
In some cases, you may need to process some existing general-purpose documents, such as PDF and Microsoft Word documents, which can also be retrieved. It is easy to implement Solr because Solr directly integrates with the Apache Tika project, which supports almost all popular document formats.
Data import from a relational database
If the data you want to search for is stored in a traditional relational database, you can configure Solr to create a document using SQL query statements.
Multi-language support
Solr and Lucene support for multi-language environments has been developing for a long time. Solr has a built-in language automatic detection system that provides text analysis solutions for different language environments.
3.3New Features of Solr 4
Almost real-time search and query
Solr's near-real-time (NRT) query function allows applications to query the newly added text within a few seconds after the index is created. Therefore, with the NRT function, Solr can cope with scenarios with fast content updates, such as toutiao.com or social networks.
Support atomic update with optimistic concurrency mechanism
The atomic update function allows client applications to add, update, delete, or add the value range of existing documents without sending the entire document to Solr.
What if two different client users attempt to update the same document record at the same time. In this case, Solr uses an optimistic concurrency mechanism to avoid conflicting updates. In short, Solr uses a specific value range called _ Version _ to enhance the security of document updates. When two different users attempt to update the same document at the same time, the user who finally submits the update will get the data of the expired version, so the update request will fail.
Real-time Retrieval
Solr is also a type of NoSQL technology. The real-time acquisition function of Solr absolutely conforms to the typical NoSQL method. It allows you to obtain the latest version of the document content through the unique identifier of the document, there is no need to consider whether the new version of the document content is submitted to the index. Before Solr 4, the text content must be submitted to the Lucene index file before it can be accessed. With Solr4's real-time retrieval function, the process of obtaining document content through a unique identifier has been safely separated from the process of creating Lucene indexes. This is useful when the document content is updated after the index is created. You do not need to submit the document content again to create a new index.
Transaction Log write at the persistent layer
When a document is sent to Solr for indexing, its content is written into a transaction log to avoid data loss due to server faults. The transaction log of Solr is in an intermediate state between sending a document from the client application and submitting the document content to the Lucene index file. It also participates in the implementation of the real-time retrieval function, because the content of the document can be extracted by unique identifiers no matter whether the document has been submitted to the Lucene index file.
Transaction logs allow Solr to separate the persistence of updated content from the visibility of updated content. This means that the document may exist in persistent storage but is not visible in the search results. Your own application can flexibly control when new document content is submitted to the index, so that the new document content can be retrieved during search. You do not need to worry about the loss of new document content if the server crashes before you submit the index.
Use Zookeeper to easily perform sharding and replication operations
With SolrCloud, horizontal scaling becomes simple and automated. Because solr uses Apache Zookeeper to synchronize configuration and management of master and backup copies of parts. On Apache's official website, Zookeeper is described as follows: "This is a central service used to maintain configuration information, name, and provide distributed synchronization and Grouping services".
In Solr, Zookeeper specifies the copy and backup of the primary shard and Shard, and monitors whether the server can respond to the query request normally. SolrCloud is already bound to the Zookeeper service, so you can start SolrCloud without additional configuration.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More