The solution that SOLR officially gives is:
Dataimporthandler is designed to stream row One-by-one. It passes a Fetch size value (DEFAULT:500) to statement#setfetchsize which some drivers does not honor. For MySQL, add batchsize-dataSource configuration with value-1. This would pass integer.min_value to the driver as the fetch size and keep it from going out of memory for large tables. Should look Like:<datasource type= "Jdbcdatasource" name= "ds-2" driver= "Com.mysql.jdbc.Driver" url= "JDBC: Mysql://localhost:8889/mysqldatabase "batchsize="-1 "user=" root "password=" root "/>
Description: The Dataimporthandler design is supported on a per-row basis. It sets the number of fetches per statement#setfetchsize by default of 500. However some drivers do not support setting fetchsize. For MySQL, pass the Fetchsize property value-1 to the DataSource configuration. It passes Integer.min_value ( -231,-2147483648 [0x80000000]) to the driver as fetchsize, ensuring that large tables do not cause large table removal.
The official explanation given by MySQL is:
Resultsetby Default, ResultSets is completely retrieved and stored in memory. The most cases the efficient-operate and, due to the design of the MySQL network protocol, are easier to Implement. If you is working with resultsets that has a large number of rows or large values and cannot allocate heap space in your JVM for the memory required,You can tell the driver to stream the results back one row at a time. To enable this functionality,Create a Statement instance in the following manner:stmt = Conn.createstatement (Java.sql.ResultSet.TYPE_FORWARD_ONLY, Java.sql.ResultSet.CONCUR_READ_ONLY); stmt.se Tfetchsize (integer.min_value);The combination of a forward-only, read-only result set, with a fetch size of integer.min_value serves as a signal to The driver to stream result sets Row-by-row. after this,Any result sets created with the statement would be retrieved Row-by-row. There is some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the CONNECTION,or an exception would be thrown. The earliest the locks these statements hold can is released (whether they be MyISAM table-level locks or row-level locks In some other storage engine such as InnoDB) are when the statement completes. If The statement is within scope of a transaction, then locks is released when the transaction completes (which implies so the statement needs to complete first). As with the other databases,Statements is not complete until all the results pending on the statement is read or the active result set for the Statement is closed. Therefore, if using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables Referenced by the statement producing the result set.
By combining the forward-only,read-only resultset and fetchsize values for Integer.min_value as a driver, a row is used to get the result stream signal. Once set, all statement created resultset will fetch the result set in rows.
Reference documents:
"1" Https://wiki.apache.org/solr/DataImportHandlerFaq
"2" http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html
Using SOLR's dihandler to build MySQL large table full-scale index, the solution of memory overflow problem