Summary: Using SOLR as a search engine to create indexes on large data volumes in a project, this article is an experience of the author's summary of the use of SOLR, including the use of Dataimporthandler to synchronize data from the database in near real time, to test the performance of SOLR to create indexes, As well as testing SOLR's search efficiency summary. The specific search engine concept, the SOLR building method, the database MySQL use method, assumes the reader already has the foundation. The operation of this article is done on Linux. 1. Solr1.1 SOLR reads data from the database and creates the index speed (using Dataimporthandler)
L Create an index once
JAVA Heap exception occurs when the JVM memory is configured to 256M, and the JVM memory configuration is increased to 512M, setting the system environment variable: java_opts-xms256m-xmx512m, can build 2,112,890 successfully (spend 2m 46s).
The average index creation speed is: 12728/s (two string fields with a length of approximately 20 characters).
L Incremental Index creation
Note: Near real-time incremental indexing requires the time of the Write database service to synchronize with the search engine server time (the database service time precedes the search engine server time).
Using the default Dih to create the incremental index is slower (50/S~400/S) than the full index (1W/S) because it needs to be read from the database more than once (1, the ids;2 to be updated, every 1ID to the database to fetch all the columns).
Therefore, you need to change the Dih Incremental index program to read the data in full index, or take a full read out of all the columns at once, the specific file configuration is as follows:
<?xml version= "1.0" encoding= "UTF-8"?> <dataConfig> <datasource name= "MySQLServer" Type= "Jdbcdatasource" Driver= "Com.mysql.jdbc.Driver" Batchsize= "-1" Url= "Jdbc:mysql://192.103.101.110:3306/locationplatform" User= "Lpuser" password= "Jlitpassok"/> <document> <entity name= "locatedentity" pk= "id" query= "Select Id,time from Locationplatform.locatedentity where isdelete=0 and My_date > ' ${dataimporter.last_index_ Time} ' " deletedpkquery= "SELECT ID from locationplatform.locatedentity where isdelete=1 and My_date > ' ${dataimporter.last_ Index_time} ' " deltaquery= "Select-1 ID" deltaimportquery= "Select Id,time from Locationplatform.locatedentity where isdelete=0 and My_date > ' ${ Dataimporter.last_index_time} ' "> <field column= "id" name= "id"/> <field column= "Time" name= "Time"/> </entity> </document> </dataConfig> |
With this configuration, the incremental index 9000/s (two string fields) can be reached (the time is indexed in the database, which has little impact on performance).
Note: The author does not recommend the use of Dataimporthandler, there are other better and more convenient implementation can be used.
1.2 SOLR Creating index efficiency
L Concurrentupdatesolrserver Use HTTP, embedded method is not recommended. Concurrentupdatesolrserver does not require Commit,solrserver.add (DOC) to add data. Solrserver solrserver = Newconcurrentupdatesolrserver (solrurl, queue size, number of threads) it needs to be used in conjunction with autocommit, autosoftcommit configuration, The online recommendations are configured as follows:
<autoCommit> <maxtime>100000 (1-10min) </maxTime> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxtime>1000 (1s) </maxTime> </autoSoftCommit> |
17 Various types of fields (the original plain text size is approximately 200b,solrinputdocument object size is approximately 930B), and the index is created in such a way that only the ID is saved and each field indexed.
If you need specific test code can contact me.
L 17 fields, quad core cpu,16g memory, gigabit Network
Data Volume (w bar) |
Number of threads |
Queue Size |
Time (s) |
Network (MB/s) |
Rate (w Bar/s) |
200 |
20 |
10000 |
88 |
10.0 |
2.27 |
200 |
20 |
20000 |
133 |
9.0 |
1.50 |
200 |
40 |
10000 |
163 |
10.0 |
1.22 |
200 |
50 |
10000 |
113 |
10.5 |
1.76 |
200 |
100 |
10000 |
120 |
10.5 |
1.67 |
L Speed: SOLR to create index speed is positively correlated with SOLR machine CPU, in general, the CPU utilization can reach nearly 100%, the memory occupancy rate should reach nearly 100% by default, and the network and disk occupancy rate are small. Therefore, the efficiency bottleneck of creating indexes is in CPU and memory. When the memory footprint is maintained at close to 100% and the index size reaches the physical memory size, inserting new data is prone to oom errors, and you need to use the Ulimit–v Unlimited command to change the virtual Memory configured to unlimited and then start SOLR without an oom error. On 64-bit machine systems, the official recommendation is to use Mmapdirectory.
L nrtcachingdirectory speed is slow, will be in a time index added stagnation, size first large after small, decreased after the index is added to continue.
L Size: 100 million index size is about 13-16GB,2 billion index size is about 30GB.
1.3 Solr Search method
L Intersection: {name: Billion and address: Haidian} {text: Haidian and Billion degrees}.
L Union: {Name: Billion or address: Haidian} {text: Haidian OR billion degrees}.
L Exclude: {text: Haidian-billion degrees}.
L wildcard character: {Bank: China * Silver}.
L Range: {num:[30 TO60]}.
L Paging: Start rows
Sorted by: Sort
L Group-weighted Chinese participle ...
Level 140 million data search speed
L This section of the test is based on the index created in section 1.2.
L Exact Search
Data volume (Billion) |
Number of fields |
Field type |
Time (MS) |
1 |
1 |
Long |
1 |
1 |
1 |
Double |
80-1400 |
1 |
1 |
String |
7-800 |
1 |
1 |
Date |
2-400 |
1 |
2 (OR) |
Long |
2 |
1 |
2 (OR) |
Double |
200-2400 |
1 |
2 (OR) |
String |
500-1000 |
1 |
2 (OR) |
Date |
5-500 |
L Fuzzy Search
Data volume (Billion) |
Number of fields |
Field type |
Time (MS) |
1 |
1 |
Long |
2000-10000 |
1 |
1 |
Double |
1000-17000 |
1 |
1 |
String |
20-16000 |
1 |
1 |
Date |
/ |
1 |
2 (OR) |
Long |
3000-25000 |
1 |
2 (OR) |
Double |
7000-45000 |
1 |
2 (OR) |
String |
3000-48000 |
1 |
2 (OR) |
Date |
/ |
L Range Search
Data volume (Billion) |
Number of fields |
Field type |
Time (MS) |
1 |
1 |
Long |
6-46000 |
1 |
1 |
Double |
80-11000 |
1 |
1 |
String |
7-3000 |
1 |
1 |
Date |
1000-2000 |
1 |
2 (OR) |
Long |
100-13000 |
1 |
2 (OR) |
Double |
100-60000 |
1 |
2 (OR) |
String |
3000-13000 |
1 |
2 (OR) |
Date |
7000-10000 |
L Conclusion:
The larger the range, the more result data, the longer the search takes.
The first search is slower and later time is less expensive.
The experience of summarizing the use of SOLR