SOLR into data in CDH environment

Last Update:2015-10-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1Create a collection

SSH connects remotely to the CDH node that has SOLR installed.
Running the solrctl instancedir--generate/solr/test/gx_sh_tl_tgryxx_2015 command generates the default configuration for the Gx_sh_tl_tgryxx_2015 collection.
Enter the/solr/test/gx_sh_tl_tgryxx_2015/conf directory, first edit the Schema.xml configuration field information, the specific online search one piece.
Solrconfig.xml file in the other <lib/> added <lib dir= "/iflytek/solrjar/" regex= ". *\.jar"/> place dependent jars such as Oracle drivers in/ iflytek/solrjar/directory, and then remotely copy/iflytek/solrjar/to other SOLR nodes.
The following nodes are added to the
Solrconfig.xml file after the other < RequestHandler/>.

<requesthandler name= "/dataimport" class= "Org.apache.solr.handler.dataimport.DataImportHandler"

       <lst name= "defaults";

            <str name= "config" >DATA-CONFIG.XML</STR>

         </lst>

</requesthandler>
The
new file Data-config.xml in the/solr/test/gx_sh_tl_tgryxx_2015/conf directory is primarily the configuration database information, and the corresponding query statement of the data to be entered, where ${} represents the parameter, Each time you enter the data, you need to pass in these parameters, as shown in the following example:

<dataconfig>

    <datasource name= "jdbc" driver= " Oracle.jdbc.driver.OracleDriver "url=" Jdbc:oracle:thin:@10.154.5.11:1521:test1 "user=" test "password=" test " Batchsize= "10000"/>

    <document name= "Defualt";

          <entity pk= "ID" name= "gx_sh_tl_tgryxx_2015" query= "select Id,xm,sfzh as ZJHM,XB_MC as Xb,mz , TGKKDM as KKBH,TGKKMC as KKMC,TGKKSSD as kkszdz,tgsj,rksj as rybsj,createdate,etldate from ${ Dataimporter.request.tableName} t where t.etldate>= ' ${dataimporter.request.datestart} ' and T.etldate < ' $ {dataimporter.request.dateEnd} ' ">&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;

          </entity>

    </document>

</dataConfig>
Run the command solrctl instancedir--create gx_sh_tl_tgryxx_2015/solr/test/gx_sh_tl_tgryxx_2015, upload the configuration to zookeeper, You can later use the command solrctl instancedir--update gx_sh_tl_tgryxx_2015/solr/test/gx_sh_tl_tgryxx_2015 to update the modified configuration to zookeeper.
Create the collection with the Command Solrctl collection--create gx_sh_tl_tgryxx_2015-s 3, gx_sh_tl_tgryxx_2015 is the collection name, and when the collection name is the same as the previously uploaded configuration name, you do not need to specify the configuration names , do not need to specify the configuration name with the-c parameter,-s 3 This parameter specifies the number of shards of the collection, I now have 3 SOLR nodes in the cluster, which will be fragmented to 3 nodes, 3 node data is not duplicated, one of the nodes can be added shards.tolerant= The true parameter is queried with a data volume of only 2/3.

2Manual Data Entry

The above steps create a collection, and later, when you want to create additional collections, you can see the SOLR admin interface, such as:

First remove the Clean check box, select the entity created in Dataconfig, custom parameters can let us pass in the customized parameters, I created the collection in front of the 3 parameters defined, the table name and a time range, This allows you to fragment data from multiple tables into a single collection without modifying the configuration.

Why do you define the above 3 parameters? Because a collection in SOLR generally corresponds to more than one table in Oracle (large data volume, sub-year or month), a table of data volume reached hundreds of millions of, a one-time entry may be reported that the snapshot is too old, so that the entry into the data can not be distinguished, can only be re-entry, All here defines a time range to reduce the amount of data per storage to avoid the problem of the old snapshot, which is the data to re-enter the error.

If you need a few shards, you can manually adjust the parameters in the management interface.

3Automatic data entry

Above the time shard if the Shard is too large to use the index, the Shard too small manual adjustment parameters are unrealistic.

So the general idea can write the Java Timing task, the time to check the status of the storage, a complete automatic adjustment of parameters into the next batch of data.

I am more familiar with kettle, I use here is kettle to complete this work, not familiar with the kettle of the use of Java timed tasks on the line, the stock data into the need to continue to the incremental data, this is necessary to have a scheduled task.

Automatic warehousing every time you want to log logs, easy to troubleshoot which data storage failure, you can manually fill in.

SOLR's DataImport have incremental storage methods, but no use, their own write timed task can record logs to the database for easy monitoring.

Annex 2 is the job I am using, there is stock storage and incremental storage, I will delete the database information, can not directly run, only for reference, do not understand to write their own Java bar.

The following two pictures are job, specifically or with kettle open to see.

Figure - 1 Stock Storage Job Example

Figure - 2 Incremental Storage Job Example

corresponding Documents and accessories: Http://pan.baidu.com/s/1kTxptMv
With SOLR Trivia

You can query multiple collections by filling in Collection=collection1,collection2 custom parameters in the corresponding text box in raw query parameters.
Delete data for a query result: http://slave3:8983/solr/gx_ga_jjzd_kkxx/update/?stream.body=<delete><query>etldate:[ 20150106103453 to 20150116103453]</query></delete>&commit=true
When there is shard corruption in the collection Shard, the direct query to error, plus the shards.tolerant=true parameter can ignore the wrong shard, only query good shards.
Cancel Document Uniqueness Check (Disable unique document check): By default, SOLR checks the primary key for duplicates to avoid the use of the same primary key for different document. If you confirm that your document does not have duplicate primary keys, add the parameter allowdups=true to the URL to cancel the check, and for the SCV document, use Overwrite=false.
Depending on the purpose of setting the field type, the string type is not participle, I would like to implement the effect of a relational database, preferably with a string type, I was not experienced to use the text_general type, to get a lot of trouble.

Attached SOLRCTL Help

[Email protected] conf]# Solrctl--help

Usage:/usr/bin/solrctl [options] command [command-arg] [command [Command-arg]] ...

Options:

--SOLR Solr_uri

--zk zk_ensemble

--help

--quiet

Commands:

init [--force]

Instancedir [--generate Path [-schemaless]]

[--create name Path]

[--update name Path]

[--get name Path]

[--delete name]

[--list]

Collection [--create name-s <numShards>

[-A Create collection with Autoaddreplicas=true]

[-C <collection.configname>]

[-R <replicationfactor>]

[-M <maxshardspernode>]

[-n <createnodeset>]]

[--delete name]

[--reload name]

[--stat name]

[--deletedocs name]

[--list]

Core [--create name [-P name=value] ...]

[--reload name]

[--unload name]

[--status name]

cluster [--get-solrxml file]

[--put-solrxml file]

SOLR into data in CDH environment

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

SOLR into data in CDH environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

SOLR into data in CDH environment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support