1Create a collection
SSH connects remotely to the CDH node that has SOLR installed.
Running the solrctl instancedir--generate/solr/test/gx_sh_tl_tgryxx_2015 command generates the default configuration for the Gx_sh_tl_tgryxx_2015 collection.
Enter the/solr/test/gx_sh_tl_tgryxx_2015/conf directory, first edit the Schema.xml configuration field information, the specific online search one piece.
Solrconfig.xml file in the other <lib/> added <lib dir= "/iflytek/solrjar/" regex= ". *\.jar"/> place dependent jars such as Oracle drivers in/ iflytek/solrjar/directory, and then remotely copy/iflytek/solrjar/to other SOLR nodes.
- The following nodes are added to the
-
Solrconfig.xml file after the other < RequestHandler/>.
<requesthandler name= "/dataimport" class= "Org.apache.solr.handler.dataimport.DataImportHandler"
<lst name= "defaults";
<str name= "config" >DATA-CONFIG.XML</STR>
</lst>
</requesthandler>
- The
- new file Data-config.xml in the/solr/test/gx_sh_tl_tgryxx_2015/conf directory is primarily the configuration database information, and the corresponding query statement of the data to be entered, where ${} represents the parameter, Each time you enter the data, you need to pass in these parameters, as shown in the following example:
<dataconfig>
<datasource name= "jdbc" driver= " Oracle.jdbc.driver.OracleDriver "url=" Jdbc:oracle:thin:@10.154.5.11:1521:test1 "user=" test "password=" test " Batchsize= "10000"/>
<document name= "Defualt";
<entity pk= "ID" name= "gx_sh_tl_tgryxx_2015" query= "select Id,xm,sfzh as ZJHM,XB_MC as Xb,mz , TGKKDM as KKBH,TGKKMC as KKMC,TGKKSSD as kkszdz,tgsj,rksj as rybsj,createdate,etldate from ${ Dataimporter.request.tableName} t where t.etldate>= ' ${dataimporter.request.datestart} ' and T.etldate < ' $ {dataimporter.request.dateEnd} ' ">&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;
</entity>
</document>
</dataConfig>
- Run the command solrctl instancedir--create gx_sh_tl_tgryxx_2015/solr/test/gx_sh_tl_tgryxx_2015, upload the configuration to zookeeper, You can later use the command solrctl instancedir--update gx_sh_tl_tgryxx_2015/solr/test/gx_sh_tl_tgryxx_2015 to update the modified configuration to zookeeper.
Create the collection with the Command Solrctl collection--create gx_sh_tl_tgryxx_2015-s 3, gx_sh_tl_tgryxx_2015 is the collection name, and when the collection name is the same as the previously uploaded configuration name, you do not need to specify the configuration names , do not need to specify the configuration name with the-c parameter,-s 3 This parameter specifies the number of shards of the collection, I now have 3 SOLR nodes in the cluster, which will be fragmented to 3 nodes, 3 node data is not duplicated, one of the nodes can be added shards.tolerant= The true parameter is queried with a data volume of only 2/3.
2Manual Data Entry
The above steps create a collection, and later, when you want to create additional collections, you can see the SOLR admin interface, such as:
First remove the Clean check box, select the entity created in Dataconfig, custom parameters can let us pass in the customized parameters, I created the collection in front of the 3 parameters defined, the table name and a time range, This allows you to fragment data from multiple tables into a single collection without modifying the configuration.
Why do you define the above 3 parameters? Because a collection in SOLR generally corresponds to more than one table in Oracle (large data volume, sub-year or month), a table of data volume reached hundreds of millions of, a one-time entry may be reported that the snapshot is too old, so that the entry into the data can not be distinguished, can only be re-entry, All here defines a time range to reduce the amount of data per storage to avoid the problem of the old snapshot, which is the data to re-enter the error.
If you need a few shards, you can manually adjust the parameters in the management interface.
3Automatic data entry
Above the time shard if the Shard is too large to use the index, the Shard too small manual adjustment parameters are unrealistic.
So the general idea can write the Java Timing task, the time to check the status of the storage, a complete automatic adjustment of parameters into the next batch of data.
I am more familiar with kettle, I use here is kettle to complete this work, not familiar with the kettle of the use of Java timed tasks on the line, the stock data into the need to continue to the incremental data, this is necessary to have a scheduled task.
Automatic warehousing every time you want to log logs, easy to troubleshoot which data storage failure, you can manually fill in.
SOLR's DataImport have incremental storage methods, but no use, their own write timed task can record logs to the database for easy monitoring.
Annex 2 is the job I am using, there is stock storage and incremental storage, I will delete the database information, can not directly run, only for reference, do not understand to write their own Java bar.
The following two pictures are job, specifically or with kettle open to see.
Figure - 1 Stock Storage Job Example
Figure - 2 Incremental Storage Job Example
You can query multiple collections by filling in Collection=collection1,collection2 custom parameters in the corresponding text box in raw query parameters.
Delete data for a query result: http://slave3:8983/solr/gx_ga_jjzd_kkxx/update/?stream.body=<delete><query>etldate:[ 20150106103453 to 20150116103453]</query></delete>&commit=true
When there is shard corruption in the collection Shard, the direct query to error, plus the shards.tolerant=true parameter can ignore the wrong shard, only query good shards.
Cancel Document Uniqueness Check (Disable unique document check): By default, SOLR checks the primary key for duplicates to avoid the use of the same primary key for different document. If you confirm that your document does not have duplicate primary keys, add the parameter allowdups=true to the URL to cancel the check, and for the SCV document, use Overwrite=false.
- Depending on the purpose of setting the field type, the string type is not participle, I would like to implement the effect of a relational database, preferably with a string type, I was not experienced to use the text_general type, to get a lot of trouble.
[Email protected] conf]# Solrctl--help
Usage:/usr/bin/solrctl [options] command [command-arg] [command [Command-arg]] ...
Options:
--SOLR Solr_uri
--zk zk_ensemble
--help
--quiet
Commands:
init [--force]
Instancedir [--generate Path [-schemaless]]
[--create name Path]
[--update name Path]
[--get name Path]
[--delete name]
[--list]
Collection [--create name-s <numShards>
[-A Create collection with Autoaddreplicas=true]
[-C <collection.configname>]
[-R <replicationfactor>]
[-M <maxshardspernode>]
[-n <createnodeset>]]
[--delete name]
[--reload name]
[--stat name]
[--deletedocs name]
[--list]
Core [--create name [-P name=value] ...]
[--reload name]
[--unload name]
[--status name]
cluster [--get-solrxml file]
[--put-solrxml file]
SOLR into data in CDH environment