Comparison between Solr full-text search and MySQL Query performance

Source: Internet
Author: User
Tags json odbc mysql query solr solr query

Test data volume: 10407608
Num Docs: 10407608

One of the most common queries in a project is to query data within a certain period of time, and obtain data through SQL query for about 30 s.

SELECT * FROM 'Tf _ hotspotdata_copy_test 'WHERE collectTime BETWEEN '2017-12-06 00:00:00' AND '2017-12-10 21:31:55 ';

After indexes are created for collectTime, the same query speed is 2 s, which is much faster.

Solr index



Solr query, the same condition, 72 ms

"Status": 0,

"QTime": 72,

"Params ":{

"Indent": "true ",

"Q": "CollectTime: [2014-12-06T00: 00: 00.000Z TO 2014-12-10T21: 31: 55.000Z]",

"_": "1434617215202 ",

"Wt": "json"

    }

Well, the query performance is not improved a little bit. Try Solrj code:


SolrQuery params = new SolrQuery (); params. set ("q", timeQueryString); params. set ("fq", queryString); params. set ("start", 0); params. set ("rows", Integer. MAX_VALUE); params. setFields (retKeys); QueryResponse response = server. query (params );



Solrj queries and obtains the result set. The result set size is 220296, and five fields are returned. The time is about 12 s.

Why does it take so long? The above "QTime" is only based on the index query time. To obtain the query result set from the solr server, solr needs to read the stored field (disk IO ), it is also time-consuming to transmit data to a local machine (network I/O) over Http, especially disk I/O.

Time comparison:

Query condition Time
MySQL (no index) 30 s
MySQL (indexed) 2 s
Solrj (select query) 12 s

How to optimize it? Let's take a look at the time required to get the ID only:

Only id is returned for SQL query, no index is created for collectTime, about 10 s

SELECT id FROM 'Tf _ hotspotdata_copy_test 'WHERE collectTime BETWEEN '2017-12-06 00:00:00' AND '2017-12-10 21:31:55 ';

Only id is returned for SQL query. The same query conditions are used to create an index for collectTime, 0.337 s, and very fast.

Solrj queries only return ids, about 7 s, a little faster.

Id Size: 220296

Time: 7340

Time comparison:

Query condition (get ID only) time
MySQL (no index) 10 s
MySQL (with indexes) 0.337 s
Solrj (select query) 7 s

Continue optimization ..

Solrj is slow in obtaining a large number of result sets:

Http://stackoverflow.com/questions/28181821/solr-performance #

Http://grokbase.com/t/lucene/solr-user/11aysnde25/query-time-help

Http://lucene.472066.n3.nabble.com/Solrj-performance-bottleneck-td2682797.html

There is no good solution to this problem. The basic suggestions are paging, but we need to obtain a large amount of data for comparative analysis, and paging is meaningless.

By default, solr queries use "/select" request handler. You can use "/export" request handler to obtain the export result set to see how solr describes it:

It's possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records. this uses a stream sorting techniquethat begins to send records within milliseconds and continues to stream results until the entire result set has been sorted and exported.

Solr has defined this requestHandler:


{! Xport} xsort false query




Use the/export field to create an index using docValues:



To use docValues, you must have a field for Sort and only the following types are supported:

Sort fields must be one of the following types: int, float, long, double, string

Returned fields supported by docValues:

Export fields must either be one of the following types: int, float, long, double, string

Use Solrj to query and obtain data:


SolrQuery params = new SolrQuery (); params. set ("q", timeQueryString); params. set ("fq", queryString); params. set ("start", 0); params. set ("rows", Integer. MAX_VALUE); params. set ("sort", "id asc"); params. setHighlight (false); params. set ("qt", "/export"); params. setFields (retKeys); QueryResponse response = server. query (params );




A Bug:

Org. apache. solr. client. solrj. impl. httpSolrClient $ RemoteSolrException: Error from server at http: // 192.8.125.30: 8985/solr/hotspot: Expected mime type application/octet-stream but got application/json.

Solrj cannot parse the result set correctly. After reading the source code, the reason is that the ContentType returned by Solr server is inconsistent with that returned by Solrj during resolution. The CONTENT_TYPE of Solrj's BinaryResponseParser is fixed to death:

Public class BinaryResponseParser extends ResponseParser {
Public static final String BINARY_CONTENT_TYPE = "application/octet-stream ";

At half past one, I don't know how to solve this Bug. I 'd like to write an Http request and get the result myself. I wrote a simple client request with HttpClient and parsed the json file to get the data. The test speed is as follows:

String url = "http: // 192.8.125.30: 8985/solr/hotspot/export? Q = CollectTime % 3A [2014-12-06T00% 3A00% 3A00. 000Z + TO + 2014-12-10T21% 3A31% 3A55. 000Z] & sort = id + asc & fl = id & wt = json & indent = true "; long s = System. currentTimeMillis (); SolrHttpJsonClient client = new SolrHttpJsonClient (); SolrQueryResult result = client. getQueryResultByGet (url); System. out. println ("Size:" + result. getResponse (). getNumFound (); long e = System. currentTimeMillis (); System. out. println ("Time:" + (e-s ));




If the same query condition is used to obtain 220296 result sets, the query time is about 2 seconds. The efficiency of this query is similar to that of MySQL after the index is created. It is currently acceptable.

Because Export fields only supports int, float, long, double, and string types. If your query results only contain these types of fields, query and obtain data in this way, the speed is much faster.

The following is the speed comparison between Solr's use of "/select" and "/export.

Time comparison:

Query condition Time
MySQL (no index) 30 s
MySQL (indexed) 2 s
Solrj (select query) 12 s
Solrj (export query) 2 s

If paging query is used in the project, select is used. If a large amount of query data is to be obtained at one time, export is used. MySQL is not used to create an index on the query field, because the amount of data is still increasing every day, when the data volume reaches hundreds of millions, the index cannot solve the problem well, and there are other query requirements in the project.

Let's look at another query requirement. Assume that we want to count the distribution of data on each device (deviceID:

To use SQL, 33 s is required:

SELECT deviceID, Count (*) FROM 'Tf _ hotspotdata_copy_test 'group by deviceID;

For the same query, after the index is created for CollectTime, it takes only 14 s.

Look at the Solr Facet query, as long as 540 ms, not a little faster.

SolrQuery query = new SolrQuery (); query. set ("q", "*: *"); query. setFacet (true); query. addFacetField ("DeviceID"); QueryResponse response = server. query (query); FacetField idFacetField = response. getFacetField ("DeviceID"); List idCounts = idFacetField. getValues (); for (Count count: idCounts) {System. out. println (count. getName () + ":" + count. getCount ());}



Time comparison:

Query condition (statistics) time
MySQL (no index) 33 s
MySQL (indexed) 14 s
Solrj (Facet query) 0.54 s

If we want to query the data statistics of a device over a certain period of time by hour, week, month, or year, Solr is also very convenient, for example, you can use the following daily statistics for data on the device number 1013:


String startTime = "00:00:00"; String endTime = "21:31:55"; SolrQuery query = new SolrQuery (); query. set ("q", "DeviceID: 1013"); query. setFacet (true); Date start = DateFormatHelper. toSolrSearchDate (DateFormatHelper. stringToDate (startTime); Date end = DateFormatHelper. toSolrSearchDate (DateFormatHelper. stringToDate (endTime); query. addDateRangeFacet ("CollectTime", start, end, "+ 1DAY"); QueryResponse response = server. query (query); List dateFacetFields = response. getFacetRanges (); for (RangeFacet facetField: dateFacetFields {List dateCounts = facetField. getCounts (); for (org. apache. solr. client. solrj. response. rangeFacet. count count: dateCounts) {System. out. println (count. getValue () + ":" + count. getCount ());}}




2014-12-06T00: 00: 00Z: 58

2014-12-07T00: 00: 00Z: 0

2014-12-08T00: 00: 00Z: 0

2014-12-09T00: 00: 00Z: 0

2014-12-10T00: 00Z: 3707

2014-12-11T00: 00: 00Z: 8384

2014-12-12T00: 00: 00Z: 7803

2014-12-13T00: 00: 00Z: 2469

2014-12-14T00: 00: 00Z: 142

2014-12-15T00: 00: 00Z: 34

2014-12-16T00: 00: 00Z: 0

Time: 662

 

Horizontal split table:

Because the large amount of data collected by the system is closely related to the "time", some business needs also query a lot based on the "time". You can split the table by the "time" field, for example, you can split a table every month. However, you need to do more things for the application-layer code, and more work is required for cross-table queries. After considering the workload of table splitting and using Solr for index query, Solr is used.


Conclusion: based on MySQL, combined with Lucene, Solr, ElasticSearch and other search engines, query performance such as full-text search and classified statistics can be improved.







Comparison between Solr and Sphinx

Solr version: 4.8.1
Coreseek version: 3.2.14 (based on sph00000.9.9 release)

1. Index efficiency
Sphinx: 10-15 MB/second. During actual measurement, up to 0.1 billion GB of text can be indexed. A single index can contain records.
Solr: 10 MB/second, with hundreds of millions of data records and 20 GB index. Sina also uses lucene for retrieval.
2. Search performance
Sphinx: high-performance Search. On 2-4 GB of text data, the average response time for each search is less than 0.1 seconds. Search for 1g of text and 1 million documents, supporting up to 150-4 seconds ~ 250 queries;
Solr: high-performance search, 8 GB index file, the retrieval response time is 150 ms. 500 concurrent requests per second during peak hours.
3. Scalability
Sphinx: High Scalability
Solr: High Scalability
4. Relevance algorithms
Sphbenchmark: a combined Ranking method based on phrase similarity and statistics (BM25). It supports dynamic score calculation.
Solr: supports dynamic score calculation and index-based score calculation.
5. Support for phrase search
Supported
Supported
6. Whether distributed search is supported
Supported
Supported
7. Whether the Mysql engine is supported
Supported
Supported
8. Whether full-text search of multiple fields is supported
Supported
Supported
9. Data interaction
Http request, Xml, Json
Http request, Xml, Json
10. Data source
Sphinx: mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc, python
Solr: mysql, Xml, json, csv, odbc
11. Multi-Index
Supported
Supported
12. Multi-core (different multi-index removal)
Sphenders: not supported
Solr: Supported

13. Comparison of Chinese word segmentation
Currently, sphsf-only supports the mmseg3 and sphsf-for chinese word segmentation. Currently, mmseg3 is widely used. The dictionary of mmseg3 needs to be pre-compiled, which is not conducive to the expansion of the dictionary.
Solr currently supports many word libraries. Currently, it supports Ding, IK, and mmsegj4. By comparing word segmentation effects, IK performs better in word segmentation detection and accuracy detection. IK word segmentation supports deprecated words and extended word libraries. Each word in the dictionary can be expanded in one row.
14. Support for data indexing from the Mysql database
You can configure the data source to index data directly from mysql. When the data volume is large, you can configure the primary index + incremental index to synchronize data. However, the primary index must be based on the primary key Id of the Mysql table to set the index range. For example, if the Id is less than 9000,000 and the Id is greater than 9000,000, incremental indexes are used for synchronization. This only adapts to frequent data updates with IDs less than 9000,000.
If the primary index and the incremental index are distinguished by the last update time of the record, because the primary index and the incremental index use two different index files, this will cause data inconsistency between the primary index and the incremental index, resulting in records that should not have occurred during retrieval. In this case, you need to merge the incremental index into the primary index. If the data of the primary index is large, you are not sure how long the merge takes.
When using solr to index mysql data, you can configure it as the primary index + incremental index. The primary index and incremental index are separated by the updated timestamp. solr records the timestamp of each update. Because solr incremental indexes and incremental indexes use an index file, they are automatically merged into the final index when performing incremental indexes.
15. Support for real-time retrieval
Sphek itself does not support Chinese word segmentation. Coreseek is currently the most popular sphek full-text search. It provides LibMMSeg, a Chinese word segmentation package designed for sphinx. Coreseek's current stable version is 3.2.14 (developed based on sph00000.9.9 release). This version does not support real-time retrieval. Currently, Coresekk4.1 is still a test version. The test version supports real-time retrieval, but it is not stable.
Solr indexes support real-time addition, update, and deletion. You can perform incremental update based on the last update time of the record. When there is not much incremental update data, you can set an incremental update task to be updated once every 10 seconds. To achieve real-time data synchronization.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.