The experience of summarizing the use of SOLR

Source: Internet
Author: User
Tags solr

Summary: Using SOLR as a search engine to create indexes on large data volumes in a project, this article is an experience of the author's summary of the use of SOLR, including the use of Dataimporthandler to synchronize data from the database in near real time, to test the performance of SOLR to create indexes, As well as testing SOLR's search efficiency summary. The specific search engine concept, the SOLR building method, the database MySQL use method, assumes the reader already has the foundation. The operation of this article is done on Linux. 1. Solr1.1 SOLR reads data from the database and creates the index speed (using Dataimporthandler)

L Create an index once

JAVA Heap exception occurs when the JVM memory is configured to 256M, and the JVM memory configuration is increased to 512M, setting the system environment variable: java_opts-xms256m-xmx512m, can build 2,112,890 successfully (spend 2m 46s).

The average index creation speed is: 12728/s (two string fields with a length of approximately 20 characters).

L Incremental Index creation

Note: Near real-time incremental indexing requires the time of the Write database service to synchronize with the search engine server time (the database service time precedes the search engine server time).

Using the default Dih to create the incremental index is slower (50/S~400/S) than the full index (1W/S) because it needs to be read from the database more than once (1, the ids;2 to be updated, every 1ID to the database to fetch all the columns).

Therefore, you need to change the Dih Incremental index program to read the data in full index, or take a full read out of all the columns at once, the specific file configuration is as follows:

<?xml version= "1.0" encoding= "UTF-8"?>

<dataConfig>

<datasource name= "MySQLServer"

Type= "Jdbcdatasource"

Driver= "Com.mysql.jdbc.Driver"

Batchsize= "-1"

Url= "Jdbc:mysql://192.103.101.110:3306/locationplatform"

User= "Lpuser"

password= "Jlitpassok"/>

<document>

<entity name= "locatedentity" pk= "id"

query= "Select Id,time from Locationplatform.locatedentity where isdelete=0 and My_date > ' ${dataimporter.last_index_ Time} ' "

deletedpkquery= "SELECT ID from locationplatform.locatedentity where isdelete=1 and My_date > ' ${dataimporter.last_ Index_time} ' "

deltaquery= "Select-1 ID"

deltaimportquery= "Select Id,time from Locationplatform.locatedentity where isdelete=0 and My_date > ' ${ Dataimporter.last_index_time} ' ">

<field column= "id" name= "id"/>

<field column= "Time" name= "Time"/>

</entity>

</document>

</dataConfig>

With this configuration, the incremental index 9000/s (two string fields) can be reached (the time is indexed in the database, which has little impact on performance).

Note: The author does not recommend the use of Dataimporthandler, there are other better and more convenient implementation can be used.

1.2 SOLR Creating index efficiency

L Concurrentupdatesolrserver Use HTTP, embedded method is not recommended. Concurrentupdatesolrserver does not require Commit,solrserver.add (DOC) to add data. Solrserver solrserver = Newconcurrentupdatesolrserver (solrurl, queue size, number of threads) it needs to be used in conjunction with autocommit, autosoftcommit configuration, The online recommendations are configured as follows:

<autoCommit>

<maxtime>100000 (1-10min) </maxTime>

<openSearcher>false</openSearcher>

</autoCommit>

<autoSoftCommit>

<maxtime>1000 (1s) </maxTime>

</autoSoftCommit>

17 Various types of fields (the original plain text size is approximately 200b,solrinputdocument object size is approximately 930B), and the index is created in such a way that only the ID is saved and each field indexed.

If you need specific test code can contact me.

L 17 fields, quad core cpu,16g memory, gigabit Network

Data Volume (w bar)

Number of threads

Queue Size

Time (s)

Network (MB/s)

Rate (w Bar/s)

200

20

10000

88

10.0

2.27

200

20

20000

133

9.0

1.50

200

40

10000

163

10.0

1.22

200

50

10000

113

10.5

1.76

200

100

10000

120

10.5

1.67

L Speed: SOLR to create index speed is positively correlated with SOLR machine CPU, in general, the CPU utilization can reach nearly 100%, the memory occupancy rate should reach nearly 100% by default, and the network and disk occupancy rate are small. Therefore, the efficiency bottleneck of creating indexes is in CPU and memory. When the memory footprint is maintained at close to 100% and the index size reaches the physical memory size, inserting new data is prone to oom errors, and you need to use the Ulimit–v Unlimited command to change the virtual Memory configured to unlimited and then start SOLR without an oom error. On 64-bit machine systems, the official recommendation is to use Mmapdirectory.

L nrtcachingdirectory speed is slow, will be in a time index added stagnation, size first large after small, decreased after the index is added to continue.

L Size: 100 million index size is about 13-16GB,2 billion index size is about 30GB.

1.3 Solr Search method

L Intersection: {name: Billion and address: Haidian} {text: Haidian and Billion degrees}.

L Union: {Name: Billion or address: Haidian} {text: Haidian OR billion degrees}.

L Exclude: {text: Haidian-billion degrees}.

L wildcard character: {Bank: China * Silver}.

L Range: {num:[30 TO60]}.

L Paging: Start rows

Sorted by: Sort

L Group-weighted Chinese participle ...

Level 140 million data search speed

L This section of the test is based on the index created in section 1.2.

L Exact Search

Data volume (Billion)

Number of fields

Field type

Time (MS)

1

1

Long

1

1

1

Double

80-1400

1

1

String

7-800

1

1

Date

2-400

1

2 (OR)

Long

2

1

2 (OR)

Double

200-2400

1

2 (OR)

String

500-1000

1

2 (OR)

Date

5-500

L Fuzzy Search

Data volume (Billion)

Number of fields

Field type

Time (MS)

1

1

Long

2000-10000

1

1

Double

1000-17000

1

1

String

20-16000

1

1

Date

/

1

2 (OR)

Long

3000-25000

1

2 (OR)

Double

7000-45000

1

2 (OR)

String

3000-48000

1

2 (OR)

Date

/

L Range Search

Data volume (Billion)

Number of fields

Field type

Time (MS)

1

1

Long

6-46000

1

1

Double

80-11000

1

1

String

7-3000

1

1

Date

1000-2000

1

2 (OR)

Long

100-13000

1

2 (OR)

Double

100-60000

1

2 (OR)

String

3000-13000

1

2 (OR)

Date

7000-10000

L Conclusion:

The larger the range, the more result data, the longer the search takes.

The first search is slower and later time is less expensive.

The experience of summarizing the use of SOLR

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.