The experience of summarizing the use of SOLR

Last Update:2016-12-16 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary: Using SOLR as a search engine to create indexes on large data volumes in a project, this article is an experience of the author's summary of the use of SOLR, including the use of Dataimporthandler to synchronize data from the database in near real time, to test the performance of SOLR to create indexes, As well as testing SOLR's search efficiency summary. The specific search engine concept, the SOLR building method, the database MySQL use method, assumes the reader already has the foundation. The operation of this article is done on Linux. 1. Solr1.1 SOLR reads data from the database and creates the index speed (using Dataimporthandler)

L Create an index once

JAVA Heap exception occurs when the JVM memory is configured to 256M, and the JVM memory configuration is increased to 512M, setting the system environment variable: java_opts-xms256m-xmx512m, can build 2,112,890 successfully (spend 2m 46s).

The average index creation speed is: 12728/s (two string fields with a length of approximately 20 characters).

L Incremental Index creation

Note: Near real-time incremental indexing requires the time of the Write database service to synchronize with the search engine server time (the database service time precedes the search engine server time).

Using the default Dih to create the incremental index is slower (50/S~400/S) than the full index (1W/S) because it needs to be read from the database more than once (1, the ids;2 to be updated, every 1ID to the database to fetch all the columns).

Therefore, you need to change the Dih Incremental index program to read the data in full index, or take a full read out of all the columns at once, the specific file configuration is as follows:

<?xml version= "1.0" encoding= "UTF-8"?>

<datasource name= "MySQLServer"

Type= "Jdbcdatasource"

Driver= "Com.mysql.jdbc.Driver"

Batchsize= "-1"

Url= "Jdbc:mysql://192.103.101.110:3306/locationplatform"

User= "Lpuser"

password= "Jlitpassok"/>

<entity name= "locatedentity" pk= "id"

query= "Select Id,time from Locationplatform.locatedentity where isdelete=0 and My_date > ' ${dataimporter.last_index_ Time} ' "

deletedpkquery= "SELECT ID from locationplatform.locatedentity where isdelete=1 and My_date > ' ${dataimporter.last_ Index_time} ' "

deltaquery= "Select-1 ID"

deltaimportquery= "Select Id,time from Locationplatform.locatedentity where isdelete=0 and My_date > ' ${ Dataimporter.last_index_time} ' ">

</entity>

</document>

</dataConfig>

With this configuration, the incremental index 9000/s (two string fields) can be reached (the time is indexed in the database, which has little impact on performance).

Note: The author does not recommend the use of Dataimporthandler, there are other better and more convenient implementation can be used.

1.2 SOLR Creating index efficiency

L Concurrentupdatesolrserver Use HTTP, embedded method is not recommended. Concurrentupdatesolrserver does not require Commit,solrserver.add (DOC) to add data. Solrserver solrserver = Newconcurrentupdatesolrserver (solrurl, queue size, number of threads) it needs to be used in conjunction with autocommit, autosoftcommit configuration, The online recommendations are configured as follows:

<openSearcher>false</openSearcher>

</autoCommit>

</autoSoftCommit>

17 Various types of fields (the original plain text size is approximately 200b,solrinputdocument object size is approximately 930B), and the index is created in such a way that only the ID is saved and each field indexed.

If you need specific test code can contact me.

L 17 fields, quad core cpu,16g memory, gigabit Network

Data Volume (w bar)	Number of threads	Queue Size	Time (s)	Network (MB/s)	Rate (w Bar/s)
200	20	10000	88	10.0	2.27
200	20	20000	133	9.0	1.50
200	40	10000	163	10.0	1.22
200	50	10000	113	10.5	1.76
200	100	10000	120	10.5	1.67

L Speed: SOLR to create index speed is positively correlated with SOLR machine CPU, in general, the CPU utilization can reach nearly 100%, the memory occupancy rate should reach nearly 100% by default, and the network and disk occupancy rate are small. Therefore, the efficiency bottleneck of creating indexes is in CPU and memory. When the memory footprint is maintained at close to 100% and the index size reaches the physical memory size, inserting new data is prone to oom errors, and you need to use the Ulimit–v Unlimited command to change the virtual Memory configured to unlimited and then start SOLR without an oom error. On 64-bit machine systems, the official recommendation is to use Mmapdirectory.

L nrtcachingdirectory speed is slow, will be in a time index added stagnation, size first large after small, decreased after the index is added to continue.

L Size: 100 million index size is about 13-16GB,2 billion index size is about 30GB.

1.3 Solr Search method

L Intersection: {name: Billion and address: Haidian} {text: Haidian and Billion degrees}.

L Union: {Name: Billion or address: Haidian} {text: Haidian OR billion degrees}.

L Exclude: {text: Haidian-billion degrees}.

L wildcard character: {Bank: China * Silver}.

L Range: {num:[30 TO60]}.

L Paging: Start rows

Sorted by: Sort

L Group-weighted Chinese participle ...

Level 140 million data search speed

L This section of the test is based on the index created in section 1.2.

L Exact Search

Data volume (Billion)	Number of fields	Field type	Time (MS)
1	1	Long	1
1	1	Double	80-1400
1	1	String	7-800
1	1	Date	2-400
1	2 (OR)	Long	2
1	2 (OR)	Double	200-2400
1	2 (OR)	String	500-1000
1	2 (OR)	Date	5-500

L Fuzzy Search

Data volume (Billion)	Number of fields	Field type	Time (MS)
1	1	Long	2000-10000
1	1	Double	1000-17000
1	1	String	20-16000
1	1	Date	/
1	2 (OR)	Long	3000-25000
1	2 (OR)	Double	7000-45000
1	2 (OR)	String	3000-48000
1	2 (OR)	Date	/

L Range Search

Data volume (Billion)	Number of fields	Field type	Time (MS)
1	1	Long	6-46000
1	1	Double	80-11000
1	1	String	7-3000
1	1	Date	1000-2000
1	2 (OR)	Long	100-13000
1	2 (OR)	Double	100-60000
1	2 (OR)	String	3000-13000
1	2 (OR)	Date	7000-10000

L Conclusion:

The larger the range, the more result data, the longer the search takes.

The first search is slower and later time is less expensive.

The experience of summarizing the use of SOLR

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More