Recently the project team has scheduled a task, the project used full-text search, based on the full-text search SOLR, but the SOLR search cloud project is not stable, often query data, need manual full-volume synchronization, and other teams in the maintenance, dependency is too strong, resulting in SOLR service a problem, our project is basically paralyzed , because all dependent queries have no result data. So consider developing an adaptation layer that automatically switches to the new search--es if SOLR searches for problems.
You can actually solve this problem by designing the SOLR cluster or service fault tolerance. But first not to consider the rationality of their own design, leadership needs to develop, so I started to set up the ES service road, starting from scratch, because before completely contact ES, so through this series to record their own development process.
What is full-text search
What is a full-text search engine?
the definition of Baidu encyclopedia :
Full-text search engine is widely used in the mainstream search engine. It works by scanning every word in the article, establishing an index on each word, indicating the number and location of occurrences of the word in the article, and when the user queries, the retrieval program searches based on the pre-established index and feeds the search results back to the user's search method. This process is similar to the process of looking up a word in a dictionary using the search Word table.
Elasticsearch create database
From the definition we can have a general understanding of the idea of full-text search, in order to explain in more detail, we start from the life of the data.
Elasticsearch vs database
The data in our lives is generally divided into two categories: structured and unstructured data .
- structured data : Refers to data that has a fixed or finite length, such as a database, metadata, and so on.
- Unstructured data: unstructured data, which can also be called full-text data, refers to data that is not long or fixed, such as messages, Word documents, and so on.
Of course, there will be a third kind of places: semi-structured data , such as xml,html, when needed can be processed according to structured data, can also be extracted from the plain text of unstructured data to deal with.
Logstash database to elasticsearch
According to two kinds of data classification, the search is divided into two kinds: structured data search and unstructured data search.
Why elasticsearch is not a database
For structured data, we are generally able to store and search through the table of a relational database (mysql,oracle, etc.), or you can build an index.
For unstructured data, there are two main methods of searching for full-text data: sequential Scanning method and full- text retrieval .
Sequential Scan : You can also see the approximate way to search by text name, that is, to query specific keywords in sequential scan mode.
For example, give you a newspaper that lets you find out where the "RNG" text appears in the newspaper. You definitely need to scan the newspaper from beginning to end and then mark out which sections of the keyword have appeared and where it appears.
This is undoubtedly the most time-consuming and least effective, if the newspaper typesetting font small, and more than one section of the newspaper, and so you scan your eyes are almost.
Full-Text Search : The sequential scan of unstructured data is slow, can we optimize it? How do we get some structure out of our unstructured data? A part of the unstructured data is extracted, re-organized, so that it has a certain structure, and then a certain structure of the data to search, so as to achieve a relatively fast search. This method constitutes the basic idea of full-text retrieval. This part extracts the information from unstructured data and then re-organizes it, which we call the index .
Also read the newspaper for example, we want to pay attention to the recent League of Legends S8 Global Finals News, if all are rng fans, how to quickly find the RNG news newspapers and sections? The full-text search is done by extracting keywords from all the sections of the newspaper, such as "EDG", "RNG", "FW", "Team", "League of Legends", etc. These keywords are then indexed and indexed so that we can correspond to the newspapers and sections that appear on that keyword. Pay attention to the difference directory search engine.
Why use full-text search engines
Before, some colleagues asked me, why use a search engine? All of our data are in the database, and Oracle, SQL Server and other databases can also provide query retrieval or clustering analysis function, directly through the database query can not? Indeed, most of our query capabilities can be obtained through database queries, and if the query is inefficient, you can improve efficiency by building database indexes, optimizing SQL, and even by introducing caching to speed up data return. If the amount of data is larger, you can share the query pressure by splitting the database into tables.
Then why do you want the full-text search engine? We mainly analyze from the following several reasons:
Full-text index search supports search for unstructured data to better quickly search for unstructured text for any number of words or groups of words that exist.
For example, Google, Baidu's web site search, they are based on the keywords generated in the page index, we search for keywords, they will be the keyword that index matching all the pages to return, as well as the common project Application log search and so on. For these unstructured data literals, relational database search is not very well supported.
Maintenance of indexes
General traditional database, full-text retrieval is very chicken, because generally no one with the data Inventory text field. Full-text retrieval requires scanning the entire table, and even if the SQL syntax is optimized, it has little effect if the amount of data is large. Indexes are built, but maintenance is cumbersome, and indexes are rebuilt for both insert and update operations.
When to use full-text search engines:
- The data object that is searched is a large amount of unstructured text data.
- The amount of file records is hundreds of thousands of or millions of or more.
- Supports a large number of interactive text-based queries.
- Full-Text search queries with very flexible requirements.
- There are special requirements for highly relevant search results, but no relational databases are available to satisfy.
- A relatively small number of requirements for different record types, non-text data operations, or secure transaction processing.
Now the mainstream search engine is probably: Lucene,solr,elasticsearch.
They are indexed based on an inverted index, what is an inverted index?
Inverted index (English: Inverted index), also often referred to as a reverse index, place file, or reverse file, is an indexed method that is used to store the mapping of a word in a document or group of documents under a full-text search. It is the most commonly used data structure in the document retrieval system.
Lucene is a Java full-text search engine that is written entirely in Java. Lucene is not a complete application, but rather a code base and API that can be easily used to add search functionality to an application.
Lucene provides powerful features through a simple API:
Scalable, high-performance indexes
- Over 150GB/hr on modern hardware
- Small RAM requirements-only 1MB heap
- Incremental indexes are as fast as batch indexes
- Index size is approximately 20-30% of the indexed text size
Powerful, accurate and efficient search algorithm
- Ranking search-first to return the best results
- Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries, etc.
- field search (e.g. title, author, content)
- Sort by any field
- Multi-index Search using merge Results
- Allow simultaneous updates and searches
- Flexible facet, highlighting, joining and grouping of results
- Quick, memory-efficient and error-tolerant recommendations
- Pluggable ranking models, including vector space models and Okapi BM25
- Configurable storage engine (codec)
- Available as open source software under Apache licensing, allowing you to use Lucene in commercial and open source programs
- 100%-pure Java
- Implementations in other programming languages that are available are index-compatible
Apache Software Foundation
Support for the Apache community in the Apache Software Foundation's Open source software project.
But Lucene is just a framework that takes full advantage of its functionality, requires Java, and integrates Lucene into the program. Need a lot of learning to understand, to understand how it works, skilled use of lucene is really very complex.
Apache SOLR is an open-source search platform built on a Java library called Lucene. It provides Apache Lucene search functionality in a user-friendly manner. As an industry participant for nearly a decade, it is a mature product with a strong and extensive user community. It provides distributed indexing, replication, load-balancing queries, and automatic failover and recovery. If it is properly deployed and well managed, it can become a highly reliable, scalable, and fault-tolerant search engine. Many internet giants, such as Netflix,ebay,instagram and Amazon (CloudSearch), use SOLR because it is able to index and search multiple sites.
The main features list includes:
- Full-Text Search
- Faceted Search
- Real-time indexing
- Dynamic cluster
- Database integration
- NoSQL features and rich document processing (such as Word and PDF files)
Elasticsearch is an open source (Apache 2 license) and is a restful search engine built on the Apache Lucene library.
Distributed search engines include indexes that can be partitioned into shards, and each shard can have multiple replicas. Each elasticsearch node can have one or more shards, and its engine can act as a coordinator to delegate operations to the correct shards.
The Elasticsearch can be extended with near real-time search. One of its main functions is more?? Tenants.
The main features list includes:
- Distributed search
- Analysis Search
- Grouping and aggregation
Elasticsearch vs. SOLR options
Due to the complexity of lucene, it is rarely considered as the first choice for search, excluding some companies need to self-research the search framework, the underlying need to rely on Lucene. So here we focus on the analysis of Elasticsearch and SOLR.
Elasticsearch vs. Solr. Which one is better? What's the difference between them? Which one should you use?
Apache SOLR is a mature project with a large and active development and user community, as well as the Apache brand. SOLR was first released to open source in 2006 and has long been a search engine for anyone who needs a search function. Its maturity translates into rich functionality, not just simple text indexing and searching, such as faceted, grouped, powerful filtering, pluggable document processing, pluggable search chain components, language detection, and more.
SOLR has dominated the search field for many years. Then, in about 2010 years, Elasticsearch became another option on the market. At that time, it was far from SOLR so stable, without SOLR's functional depth, no thought sharing, branding and so on.
Although Elasticsearch is very young, it also has some of its own advantages, Elasticsearch is built on more modern principles, for more modern use cases, and is built for easier handling of large indexes and high query rates. In addition, because it is too young to work with a community, it is free to move forward without any consensus or collaboration with others (users or developers), backwards compatibility, or any other more mature software that is often required to process.
As a result, it exposes some of the most popular features before SOLR (for example, near real-time search, English: Close real-time search). Technically, the NRT search capability really comes from Lucene, which is the base Search library used by SOLR and Elasticsearch. Ironically, because Elasticsearch first exposed the NRT search, people associate NRT search with Elasticsearch, and although SOLR and Lucene are both part of the same Apache project, people will first expect SOLR Features with such high requirements.
Comparison of feature differences
Both search engines are popular, advanced open-source search engines. They are built around the core underlying search library-lucene-but they are different. Like everything, each has its advantages and disadvantages, each of which may be better or worse depending on your needs and expectations. Both SOLR and Elasticsearch are developing quickly, so don't say much, first look at their difference list:
| Communities and Developers
|| Apache Software Foundation and Community support
|| Single business entity and its employees
| Node discovery
|| Apache Zookeeper, proven in a number of projects and tested in combat
|| Zen is built into the elasticsearch itself, requiring dedicated master nodes for split-brain protection
| Fragment Placement
|| Essentially static, requires manual work to migrate shards, starting with SOLR 7-the AutoScaling API allows for some dynamic operations
|| Dynamic, can move shards on demand based on cluster status
|| Global, each segment change is invalid
|| Each paragraph, more suitable for dynamic change data
| Analysis Engine Performance
|| Ideal for static data that is accurately calculated
|| The accuracy of the results depends on the data placement
| Full-Text Search feature
|| Lucene-based linguistic analysis, multiple suggestions, spell checking, rich highlighting support
|| Lucene-based linguistic analysis, single recommendation API implementation, highlighting recalculation
| DevOps Support
|| Not yet complete, but coming soon
|| Very good API
| Non-planar data processing
|| Nested documents and parent-child support
|| Natural support for nesting and object types allows for almost unlimited nesting and parent-child support
| Querying the DSL
|| JSON (Limited), XML (limited), or URL parameters
| Index/Collect Leadership control
|| Leader placement control and leader rebalancing can even load on the node
|| No way
| Machine learning
|| Built-in flow aggregation, focus on logistic regression and Learning rank contribution module
|| Business functions, focusing on anomalies and outliers and time series data
Learn more here.
In addition, we analyze from the following aspects:
- Popular trends in recent years
Let's take a look at Google search trends for both of these products. Google trends suggest that Elasticsearch is attractive compared to SOLR, but that does not mean that Apache Solr has died. While some may not think so, SOLR is still one of the most popular search engines, with strong community and open source support.
Installation and Configuration
The Elasticsearch is easy to install and very lightweight compared to SOLR. In addition, you can install and run elasticsearch within minutes.
However, this ease of deployment and use can be a problem if elasticsearch is poorly managed. The JSON-based configuration is simple, but if you want to specify a comment for each configuration in the file, it is not for you.
Overall, if your app is using JSON, Elasticsearch is a better choice. Otherwise, use SOLR, because its schema.xml and solrconfig.xml are well documented.
SOLR has a larger, more mature user, developer and contributor community. ES has a small but active user community and a growing community of contributors.
SOLR is the real open source community code. Anyone can contribute to SOLR and select the new SOLR developer (also known as the submitter) based on merit. Elasticsearch is technically open source, but not so important in spirit. Anyone can see the source, anyone can change it and contribute, but only elasticsearch employees can actually make changes to Elasticsearch.
SOLR contributors and submitter come from many different organizations, and the Elasticsearch submitter is from a single company.
SOLR is more mature, but es grows fast and I think it's stable.
Solr scored very high here. It is a very well-documented product with clear examples and API use case scenarios. Elasticsearch's documentation is well-organized, but it lacks good examples and clear configuration instructions.
So, what is SOLR or Elasticsearch?
Sometimes it's hard to find a definite answer. Whether you choose Solr or Elasticsearch, you first need to understand the right use cases and future requirements. Summarize each of their attributes.
- Elasticsearch is more popular among new developers due to its ease of use. However, if you are used to working with SOLR, continue to use it because migrating to Elasticsearch does not have a specific advantage.
- Elasticsearch is a better choice if you need it to handle analytic queries in addition to searching for text.
- If you need a distributed index, you need to select Elasticsearch. Elasticsearch is a better choice for cloud and distributed environments that require good scalability and performance.
- Both have good business support (consulting, production support, integration, etc.)
- Both have good operational tools, and although Elasticsearch attracts more devops people because of its easy-to-use API, it can create a more vivid tool ecosystem around it.
- Elasticsearch dominates the open source log management use case, and many organizations index their logs in Elasticsearch to make them searchable. Although SOLR can now be used for this purpose, it just misses the idea.
- SOLR is still more oriented towards text search. On the other hand, Elasticsearch is typically used to filter and group-analyze query workloads-not necessarily text searches. Elasticsearch developers put a lot of effort into the Lucene and Elasticsearch levels to make such queries more efficient (lower memory footprint and CPU usage). Therefore, Elasticsearch is a better choice for applications that require not only text search but also complex search time aggregation.
- Elasticsearch is easier to get started with, a download and a command to start everything. SOLR has traditionally needed more work and knowledge, but SOLR has recently made great strides in eliminating this and is now only trying to change its reputation.
- In terms of performance, they are roughly the same. I said "roughly," because no one had done a full and unbiased benchmark test. For a 95% use case, either option would be good in terms of performance, and the remaining 5% would need to test both solutions with their specific data and specific access patterns.
- In operation, Elasticsearch is relatively simple to use-it has only one process. SOLR relies on Apache ZooKeeper in its fully distributed deployment mode Solrcloud similar to Elasticsearch. Zookeeper is super mature, super widely used and so on, but it is still another active part. In other words, if you are using Hadoop,hbase,spark,kafka or some other newer distributed software, you may already be running zookeeper somewhere in your organization.
- Although Elasticsearch has built-in zookeeper-like components Xen, zookeeper can better prevent the dreaded split-brain problems that sometimes occur in elasticsearch clusters. To be fair, elasticsearch developers are aware of the problem and are committed to improving the elasticsearch.
- If you like monitoring and indicators, then using Elasticsearch, you will go to heaven. This thing has more indicators than the New Year's Eve can squeeze in Times Square! SOLR exposes key indicators, but far less than Elasticsearch.
In short, both are feature-rich search engines that, when designed and implemented properly, provide the same performance more or less.
Full-text search engine ElasticSearch or SOLR?