Http://guoyunsky.iteye.com/blog/759148
SOLR has a handy processor called dataimporthandler, which can configure various data sources by configuring the configuration db-data-config.xml and then
It is convenient for us to develop and index the imported data. however, there has been a problem before importing data from the database, that is, if the data in the database is too large, it will cause memory overflow. after reading the source code and sending an email to the SOLR email list, I finally found a solution. Share it here.
Here, my SOLR version is solr1.4.0 and the database is SQL server2005. other databases may not be applicable (Please share it with those who have succeeded in running other databases ), however, you should have your own solutions.
Data imported from the database in dataimporthandler is indexed mainly through JDBC. since I have a weak understanding of JDBC, I always think that JDBC is a one-time data to be queried to read from the database. however, I did not expect that the data obtained from the database is actually in the form of a stream, and can be obtained through the segment. that is to say, you can get a new piece of data from the stream every time the client gets one (thanks for the tips ). since the database used is SQL server2005, I wanted to use its sqljdbc. jar to get some tips (try the source code is not successful), from the jar probably found that SQL Server has such settings, so on the Microsoft official website msdn get the answer (URL: http://msdn.microsoft.com/zh-cn/library/ms378663 (SQL .90 ). aspx ):
Sets the default cursor type that is used for all result sets that are created by using this sqlserverdatasource object.
Copy
Public void setselectmethod (Java. Lang. String selectmethod)
Parameters
Selectmethod
A string value that contains the default cursor type.
Remarks
The selectmethod is the default cursor type that is used for a result set. this property is useful when you are dealing with large result sets and do not want to store the whole result set in memory on the client side. by setting the property to "cursor," you can create a server-side cursor that can fetch smaller chunks of data at a time. if the selectmethod property is not set, getselectmethod returns the default value of "direct ".
At the same time, set two attributes of connection: resultset. type_forward_only, resultset. concur_read_only. however, the tracking source code found that SOLR already has such settings. so you can directly complete the configuration in the db-data-config.xml, the specific configuration is as follows:
<Datasource name = "dssqlserver" type = "jdbcdatasource" driver = "com. microsoft. sqlserver. JDBC. sqlserverdriver "batchsize =" 3000 "url =" JDBC: sqlserver: // 192.168.1.5: 1433; databasename = testdatabase; responsebuffering = adaptive; selectmethod = cursor "user =" sa "Password =" 12345 "/>
Specifically, you only need to add responsebuffering = adaptive; selectmethod = cursor to the URL. No matter how large the table is, SOLR can read data from it and complete the index according to this configuration. Of course, the premise is that no fault occurs, such as a network fault. this idea was also verified by SOLR developers. The following is an email they replied to me:
That's not really true. dataimporthandler streams the result from database query and adding documents into index. so it shouldn't load all database data into memory. disabling autocommit, warming queries and spellcheckers usually decreases required amount of memory
Indexing Process. Please share your hardware details, JVM options, solrconfig and schema configuration, etc.
However, this can still cause a hidden danger, that is, if the table data is too large. this will cause SOLR to retrieve data and index it. if any fault occurs, such as a network problem, the index previously created by SOLR will be discarded, because we do not know how many indexes it has, where is the index? (the code is actually redundant. In the event of a network exception, SOLR throws an exception and rolls back the index to its original status, that is to say, everything we did was fundamental.
It is done in white, while Lucene's indexing writing speed is quite slow, so it is absolutely possible to encounter unforeseen problems within such a long period of time ). at the same time, SOLR will notify indexsearcher to allow indexreader to reopen after writing an index. Therefore, you cannot find the indexed data for a long period of time. therefore, you want to use the ID (the <uniquekey> set in SOLR) in batches. however, this ID is more convenient for numbers, such as setting the start point and the number of indexes at a time. for example, if the starting point is 1, 10000 entries are indexed at a time. So SOLR will go from 1-second, 10001-20000, so it will keep indexing. I wanted to inherit a dataimporthandler to process it myself. Then, the SOLR developer gave me an email and told me that SOLR can be fully implemented through configuration. The mail content is as follows:
You can _ batch _ import your data using full import command by providing additional request parameter (see
Http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters), I. e.
Query = "select * From my_table order by ID limit 1000000 offset $ {dataimporter. Request. offset }"
And then calling full-import command several times:
1)/dataimport? Clean = true & offset = 0
2)/dataimport? Clean = false & offset = 1000000
3)/dataimport? Clean = false & offset = 2000000
Etc
So I used the following configuration in db-data-config.xml:
<Entity name = "testentity" datasource = "dssqlserver" PK = "ID" query = "select ID, title, author, content, URL, addon from test where ID >={ {dataimporter. request. offset} and ID <=$ {dataimporter. request. offset} + 10000 ">
Simultaneous use
1)/dataimport? Clean = true & offset = 0
2)/dataimport? Clean = false & offset = 1000000
3)/dataimport? Clean = false & offset = 2000000
These URLs request SOLR for indexing. SOLR will read the offset parameter to create a query statement, and then retrieve the data of these query statements to index. however, it is best to write a class for further control. you can obtain the number of summary results from the table at the beginning and then complete it through the above loop iteration. + 10000 can also be set as a parameter, such as $ {dataimporter. request. intervalrows} indicates how many items are obtained at a time, and then add the & intervalrows parameter to the URL. I have not tried it yet. at the same time, because the index is still a long-term process, it is best to record the interval at which each index is successful. for example, when the index is-, the system records the logs. in this case, when the index is between and, we can use the log to obtain the interval of the last successful index, and then re-generate the URL from this interval to index the data.
We hope this will be helpful to you. If you have any questions, please leave a message. At the same time, I only process sqlserver2005 here. If operations on other databases are successful, please share them. At the same time, if you can ensure that the data obtained at intervals, such as ipv-20000, does not cause memory overflow, you can also choose not to use the responsebuffering = adaptive; selectmethod = cursor configuration. you can directly set the offset of the request URL.