Elasticsearch Introduction *
Elasticsearch is a real-time, distributed search and analysis engine. It can help you deal with large-scale data at an unprecedented rate.
It can be used for full-text search, structured search and analysis, and of course you can combine the three.
Elasticsearch is a search engine based on the full-text search engine Apache lucene™, which can be said that Lucene is the most advanced and efficient full-featured open source search engine framework today.
But Lucene is just a framework that takes full advantage of its functionality, requires Java, and integrates Lucene into the program. It takes a lot of learning to understand how it works, and Lucene is really complicated.
Elasticsearch uses Lucene as an internal engine, but when using it for full-text search, you only need to use a uniformly developed API, without having to understand how the complex lucene behind it works.
Of course elasticsearch is not just lucene so simple, it includes not only full-text search function, but also can do the following work:
Distributed real-time file storage, and each of the fields are indexed so that they can be searched.
Distributed search engine for real-time analysis.
Can scale to hundreds of servers, processing petabytes of structured or unstructured data.
With so many features integrated into a single server, you can easily communicate with ES's RESTful API through the client or any of your favorite programming languages.
Elasticsearch is very easy to get started with. It comes with a lot of very reasonable default values, which makes it easy for beginners to avoid the complicated theory of getting started.
It is ready to use and can become productive with a small learning cost.
As the deeper you learn, the more advanced features of Elasticsearch can be leveraged, and the entire engine can be configured flexibly. You can customize your own elasticsearch according to your own needs.
Use case:
Wikipedia uses Elasticsearch to perform full-text searches and highlight keywords, as well as providing search suggestions such as Search-as-you-type, Did-you-mean, and more.
The Guardian uses Elasticsearch to process guest logs so that the public can respond in real time to editorial responses to different articles.
StackOverflow combines full-text search with geolocation and related information to provide more-like-this-related issues.
GitHub uses Elasticsearch to retrieve more than 130 billion lines of code.
Every day, Goldman Sachs uses it to process the index of 5TB data, and many investment banks use it to analyze stock market movements.
But Elasticsearch is not just for large companies, it has also helped a lot of startups like Datadog and Klout to expand their capabilities.
Advantages and Disadvantages of Elasticsearch * *: Benefits
- The Elasticsearch is distributed. No other components are needed, the distribution is real-time and is called "Push Replication".
- Elasticsearch fully supports Apache Lucene's near real-time search.
- Handling multi-tenancy (multitenancy) does not require special configuration, while SOLR requires more advanced settings.
- Elasticsearch uses the Gateway concept to make the complete part simpler.
- Each node makes up a peer network structure, and when some nodes fail, they are automatically assigned other nodes to work instead.
Disadvantages
- There is only one developer (the current Elasticsearch GitHub organization is more than that, already has a fairly active maintainer)
- Not enough automatic (not suitable for the current new index warmup API)
About SOLR *
SOLR (read as "solar") is an open source enterprise search platform for the Apache Lucene project. Its main functions include full-text search, hit-mark, faceted search, dynamic clustering, database integration, and rich text (such as word, PDF) processing. SOLR is highly extensible and provides distributed search and index replication. SOLR is the most popular enterprise-class search engine, and SOLR4 has added NoSQL support.
SOLR is a standalone full-text Search server written in Java that runs in a servlet container such as Apache Tomcat or jetty. SOLR uses the Lucene Java Search Library as the core of full-text indexing and searching, and has a rest-like Http/xml and JSON API. SOLR's powerful external configuration makes it possible to adapt to multiple types of applications without the need for Java encoding. SOLR has a plug-in architecture to support more advanced customization.
Since the 2010 Apache Lucene and Apache SOLR Project were merged, two projects were made by the same Apache Software Foundation development team. When referring to technology or products, LUCENE/SOLR or Solr/lucene are the same.
Advantages and disadvantages of SOLR
- SOLR has a larger, more mature community of users, developers, and contributors.
- Supports the addition of multiple formats of indexes, such as HTML, PDF, Microsoft Office series software formats, and plain text formats such as JSON, XML, and CSV.
- SOLR is more mature and stable.
- It is faster to search without having to consider building an index.
Disadvantages
- When indexing is established, search efficiency decreases and real-time index search efficiency is not high.
Comparison of Elasticsearch and SOLR *
SOLR is faster when you simply search for existing data.
When indexed in real time, SOLR generates IO blocking, poor query performance, and Elasticsearch has obvious advantages.
As the amount of data increases, SOLR's search efficiency becomes lower, and Elasticsearch does not change significantly.
In summary, SOLR's architecture is not suitable for real-time search applications.
Actual Production Environment Test *
The average query speed of the search engine from SOLR to Elasticsearch has been increased by 50 times times.
A comparative summary of Elasticsearch and SOLR
- Both installations are simple;
- SOLR uses Zookeeper for distributed management, and Elasticsearch itself with distributed coordination management functions;
- SOLR supports more formats of data, while Elasticsearch only supports JSON file formats;
- SOLR officially provides more features, while Elasticsearch itself is more focused on core functions, advanced features are provided by third-party plug-ins;
- SOLR is better than Elasticsearch in traditional search applications, but the aging rate in real-time search applications is significantly lower than that of Elasticsearch.
SOLR is a powerful solution for traditional search applications, but Elasticsearch is more suitable for emerging real-time search applications.
Other Lucene-based open source search engine solutions *
- Direct use of Lucene
Description: Lucene is a JAVA search class library that is not a complete solution in itself and requires additional development work.
Pros: Proven solutions with a lot of success stories. Apache's top-notch project is continuing to make rapid progress. A large and active development community, a huge number of developers. It's just a class library, with plenty of customization and optimization space: simple customization to meet most common needs, optimized to support 1 billion + magnitude search.
Cons: Additional development work is required. All extensions, distribution, reliability, etc. need to be implemented on their own, not real-time, there is a time lag from index to searchable, and the scalability of the current "near real time" search solution is still to be perfected
Description: Lucene-based, support for distributed, extensible, fault-tolerant, quasi-real-time search solutions.
Advantage: Out-of-the-box, can be distributed with Hadoop. Extended and fault tolerant mechanisms.
Cons: Just search for the project, build the index part still need to implement. On the search function, only the most basic requirements are realized. Fewer success stories and less maturity for the project. Because of the need to support distributed, for some complex query requirements, customization will be more difficult.
Description: Map/reduce mode, distributed indexing scheme, can be used in conjunction with Katta.
Advantages: Distributed indexing, scalability.
Cons: Just build an indexing scheme, not including search implementations. Working in batch mode, poor support for real-time search.
- LinkedIn's Open Source solutions
Description: A series of solutions based on Lucene, including quasi real-time search Zoie, facet search implementation Bobo, machine learning algorithm decomposer, abstract repository Krati, database schema packaging sensei, etc.
Pros: Proven solutions that support distributed, scalable, and rich feature implementations
Cons: Too close to LinkedIn and less customizable
Description: Based on Lucene, index exists in Cassandra database
Advantages: The advantages of reference Cassandra
Cons: Refer to the disadvantages of Cassandra. In addition, this is just a demo, without a lot of verification
Description: Based on Lucene, index exists in HBase database
Pros: Refer to the advantages of HBase
Cons: Refer to the disadvantages of HBase. In addition, in the implementation, Lucene terms is a storage line, but each term corresponds to the posting lists is stored in a column way. As the posting lists of a single term increases, the speed of the query is greatly affected.
Full-Text Search selection--------Elasticsearch and SOLR