Why do I need a search engineThe purpose of the search is to quickly look for what is needed without browsing the entire site. The results should be sequential, the higher the correlation, the better the result should be. Filter to optimize the overall relevance of the search results
The search cannot be too slow
Because the traditional relational database can't solve this kind of problem well, it needs to introduce a special search engine. The use of Elasticsearch is deployed on a relational database, speeding up search-related SQL queries. Or add search capabilities for NoSQL and other data sources like MongoDB as a document-type NoSQL database using elasticsearch features faster search
Elasticsearch is open source software that is built on Lucene,elasticsearch to query and index documents using Lucene, making it faster and easier to expand. And you can interact with the HTTP JSON API so that your application does not need to be restricted to the Java language.
Lucene manages documents by using inverted indexing, which enables each word to maintain a list of documents associated with it. It's like pointing to a page number through a table of contents and then flipping through page numbers to a specific content page.
The blogs and tags above are a good example: the left is the raw data, and a blog can have multiple tags. Right by inverted index makes each tag point to the blog that owns it. This way, when searching for tags, the left side can only iterate through the lookup, and the right side can directly find all the corresponding blogs. ensure the relevance of the results
Correlation is a very important concept that is used to prove whether the search document is really about the keyword or just the keyword. The simplest example: perhaps the more times a document appears in the keyword, the more it has to do with the keyword.
Elasticsearch defaults using TF-IDF (term frequency–inverse document frequency) for correlation calculation
, it means: Term frequency: Word frequency, refers to the occurrence of words in the document frequencies, the higher the correlation degree of frequency is higher
Inverse Document frequency: Reverse documentation frequency, refers to the number of other documents that appear in the word, the lower the frequency of the reverse document the higher the correlation
That is, if a word appears in a document many times and rarely appears in other documents, the word can bring a very high degree of relevance to the document.
Elasticsearch also provides some other ways of calculating correlation, such as increasing the influence of a field, or even using a script to achieve a method of calculating the correlation score. This can almost satisfy a variety of needs, whether you want the keyword to appear in the title of the blog more forward, or hope that the more praise of the blog more than the front, or more new blog to the front. not just exact matches
Elasticsearch also provides some configuration to support the wrong words, derivative words (such as single plural, various tenses, etc.) to improve the accuracy of matching. At the same time can also support the function of keyword Association. Elasticsearch Usage Scenarios as the primary data source
Generally, search engines are built on top of other data sources to provide a better search experience. This is because most of the previous search engines were unable to provide reliable storage and some commonly used functions, such as statistics.
But today Elasticsearch provides these features, so it can be used directly as a database, of course only for certain scenarios.
For example, a blog application, which does not have a complex relationship and is insensitive to transactions, is ideal for elasticsearch (except for missus and wife memory).
As pictured above, it is indexed to Elasticsearch when a new article is created, and then the content of the article is retrieved from the query. Whether it's a simple primary key query, or a complex tab-and-category query, including search capabilities, it's easy to implement. Even through aggregation and modification of relevance to do some tag statistics, hot articles and other complex functions. as a secondary data source
Elasticsearch is not suitable for all scenarios, for example, it does not have the concept of a transaction, nor does it perform well with associated queries. So more scenarios are supporting an existing master data source and providing support in the field of search and real-time analytics.
When working with multiple data sources, you must ensure that the data between the data sources is synchronized, and you can usually use some existing plug-ins or write a system implementation yourself.
out-of-the-Box Solutions
Elasticsearch's popularity is due in large part to the fact that it has ELK (Logstash elasticsearch Kibana), a common set of log analytics solutions. Where Logstash is used to collect logs, elasticsearch for storing and indexing logs, Kibana provides a user-friendly Web interface for displaying search results. This allows you to have a powerful log analysis system without having to write any code. Advantages of Elasticsearch
Elasticsearch provides a REST API that allows developers to easily search for documents through JSON-structured query statements, or to modify configuration information.
On top of Lucene, Elasticsearch also provides a number of more advanced features, such as caching, real-time analysis, aggregation, and statistics. and document management is more flexible, a single query can query multiple indexes at the same time.
Finally, Elasticsearch has a good scalability, the default support cluster (even if only one node is running), and it is easy to increase the node to achieve capacity expansion and disaster recovery, you can remove nodes at the necessary time to save costs. Install Elasticsearch Install Java
Elasticsearch uses Java development, so you need to install the JRE first, not in detail here.
It will look for java:java_home and system paths in the system in two ways. Environment variables can be viewed through the env (Unix-like system) and set, directly at the command line input java-version to see if there is a system path. Install Elasticsearch
Elasticsearch installation is very simple, just want to download the corresponding tar.gz package in the official website, after decompression run the startup script can:
1 2 3 |
Tar zxf elasticsearch-*.tar.gz cd elasticsearch-* Bin/elasticsearch |
View Startup Log
At startup, some logs are exported at the command line:
Start node version, PID, name and other information, elasticsearch default will give the Node a random name (here is Answer):
1 |
[2016-05-03 17:24:15,032] [INFO] [Node] [Answer] Version [1.7.1], PID [21122], build [b88f43f/2015-07-29t09:54:16z] |
To load plug-in information:
1 |
[2016-05-03 17:24:15,233] [INFO] [Plugins] [Answer] Loaded [Analysis-ik, Marvel], sites [Marvel] |
The port for internal node communication is 9300:
1 |
[2016-05-03 17:24:26,456] [INFO] [Transport] [Answer] bound_address {inet [/0:0:0:0:0:0:0:0:9300]}, publish_address {inet [/192.168.1.222:9300]} |
This node is selected as the primary node:
1 |
[2016-05-03 17:24:30,247] [INFO] [Cluster.service] [Answer] new_master [answer][bzta5mlnqw6-obfcjn4t7w][localhost.localdomain][inet [/ 192.168.1.222:9300]], Reason:zen-disco-join (elected_as_master) |
HTTP communication port is 9200:
1 |
[2016-05-03 17:24:30,371] [INFO] [HTTP] [Answer] bound_address {inet [/0:0:0:0:0:0:0:0:9200]}, publish_address {inet [/192.168.1.222:9200]} |
Node already started:
1 |
[2016-05-03 17:24:30,372] [INFO] [Node] [Answer] Started |
Restore data from the gateway, the first start must be 0:
1 |
[2016-05-03 17:24:30,702] [INFO] [Gateway] [Answer] recovered [0] indices into cluster_state |
trying to interact
Once the node has been successfully started, it can interact through the REST API, request Port 9200, and return the node information in JSON format:
1 2 3 4, 5 6 7 8 9 10 11 12 13 14 15 |
Curl http://localhost:9200 {"status": "Name": "St. John Allerdyce", "cluster_name": "Elasticsearch", "Version": {"Number": "1.7.1", "Build_hash": "B88f43fc40b0bcd7f173a1f9ee2e97816de80b19", "Build_timestamp": "2015-07-29t09 : 54:16z ", build_snapshot": false, "lucene_version": "4.10.4"}, "tagline": "You Know, for Search"} |
Summary
Elasticsearch is an open source search engine, based on the Apache Lucene
A typical scenario is to index large amounts of data and efficiently perform full-text search or live statistics
Search is not limited to full-text search, you can modify the relevance of the calculation method or give the search suggestions
Run very simple, only need to download files, decompression, run the script can
You can use the HTTP REST API to index, query data, and modify cluster settings through JSON
It can also be used as a document-type NoSQL database for real-time search and analysis
Automatically distributes the data evenly across the slices, making it easy to expand the cluster horizontally by adding nodes, and fragmentation is replicated to improve fault tolerance
Reference: http://www.scienjus.com/elasticsearch-in-action-1/