SOLR into data in CDH environment

Source: Internet
Author: User

1Create a collection
  1. SSH connects remotely to the CDH node that has SOLR installed.

  2. Running the solrctl instancedir--generate/solr/test/gx_sh_tl_tgryxx_2015 command generates the default configuration for the Gx_sh_tl_tgryxx_2015 collection.

  3. Enter the/solr/test/gx_sh_tl_tgryxx_2015/conf directory, first edit the Schema.xml configuration field information, the specific online search one piece.

  4. Solrconfig.xml file in the other <lib/> added <lib dir= "/iflytek/solrjar/" regex= ". *\.jar"/> place dependent jars such as Oracle drivers in/ iflytek/solrjar/directory, and then remotely copy/iflytek/solrjar/to other SOLR nodes.

  5. The following nodes are added to the
  6. Solrconfig.xml file after the other < RequestHandler/>.

      <requesthandler name= "/dataimport" class= "Org.apache.solr.handler.dataimport.DataImportHandler"

           <lst name= "defaults";

                <str name= "config" >DATA-CONFIG.XML</STR>

             </lst>

      </requesthandler>

  7. The
  8. new file Data-config.xml in the/solr/test/gx_sh_tl_tgryxx_2015/conf directory is primarily the configuration database information, and the corresponding query statement of the data to be entered, where ${} represents the parameter, Each time you enter the data, you need to pass in these parameters, as shown in the following example:

    <dataconfig>

        <datasource name= "jdbc" driver= " Oracle.jdbc.driver.OracleDriver "url=" Jdbc:oracle:thin:@10.154.5.11:1521:test1 "user=" test "password=" test " Batchsize= "10000"/>

        <document name= "Defualt";

              <entity pk= "ID" name= "gx_sh_tl_tgryxx_2015" query= "select Id,xm,sfzh as ZJHM,XB_MC as Xb,mz , TGKKDM as KKBH,TGKKMC as KKMC,TGKKSSD as kkszdz,tgsj,rksj as rybsj,createdate,etldate  from  ${ Dataimporter.request.tableName} t where t.etldate>= ' ${dataimporter.request.datestart} '   and T.etldate < ' $ {dataimporter.request.dateEnd} ' ">&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;

              </entity>

        </document>

    </dataConfig>

  9. Run the command solrctl instancedir--create gx_sh_tl_tgryxx_2015/solr/test/gx_sh_tl_tgryxx_2015, upload the configuration to zookeeper, You can later use the command solrctl instancedir--update gx_sh_tl_tgryxx_2015/solr/test/gx_sh_tl_tgryxx_2015 to update the modified configuration to zookeeper.
  10. Create the collection with the Command Solrctl collection--create gx_sh_tl_tgryxx_2015-s 3, gx_sh_tl_tgryxx_2015 is the collection name, and when the collection name is the same as the previously uploaded configuration name, you do not need to specify the configuration names , do not need to specify the configuration name with the-c parameter,-s 3 This parameter specifies the number of shards of the collection, I now have 3 SOLR nodes in the cluster, which will be fragmented to 3 nodes, 3 node data is not duplicated, one of the nodes can be added shards.tolerant= The true parameter is queried with a data volume of only 2/3.

2Manual Data Entry

The above steps create a collection, and later, when you want to create additional collections, you can see the SOLR admin interface, such as:

First remove the Clean check box, select the entity created in Dataconfig, custom parameters can let us pass in the customized parameters, I created the collection in front of the 3 parameters defined, the table name and a time range, This allows you to fragment data from multiple tables into a single collection without modifying the configuration.

Why do you define the above 3 parameters? Because a collection in SOLR generally corresponds to more than one table in Oracle (large data volume, sub-year or month), a table of data volume reached hundreds of millions of, a one-time entry may be reported that the snapshot is too old, so that the entry into the data can not be distinguished, can only be re-entry, All here defines a time range to reduce the amount of data per storage to avoid the problem of the old snapshot, which is the data to re-enter the error.

If you need a few shards, you can manually adjust the parameters in the management interface.

3Automatic data entry

Above the time shard if the Shard is too large to use the index, the Shard too small manual adjustment parameters are unrealistic.

So the general idea can write the Java Timing task, the time to check the status of the storage, a complete automatic adjustment of parameters into the next batch of data.

I am more familiar with kettle, I use here is kettle to complete this work, not familiar with the kettle of the use of Java timed tasks on the line, the stock data into the need to continue to the incremental data, this is necessary to have a scheduled task.

Automatic warehousing every time you want to log logs, easy to troubleshoot which data storage failure, you can manually fill in.

SOLR's DataImport have incremental storage methods, but no use, their own write timed task can record logs to the database for easy monitoring.

Annex 2 is the job I am using, there is stock storage and incremental storage, I will delete the database information, can not directly run, only for reference, do not understand to write their own Java bar.

The following two pictures are job, specifically or with kettle open to see.

Figure - 1 Stock Storage Job Example

Figure - 2 Incremental Storage Job Example

    • corresponding Documents and accessories: Http://pan.baidu.com/s/1kTxptMv

    • With SOLR Trivia

    1. You can query multiple collections by filling in Collection=collection1,collection2 custom parameters in the corresponding text box in raw query parameters.

    2. Delete data for a query result: http://slave3:8983/solr/gx_ga_jjzd_kkxx/update/?stream.body=<delete><query>etldate:[ 20150106103453 to 20150116103453]</query></delete>&commit=true

    3. When there is shard corruption in the collection Shard, the direct query to error, plus the shards.tolerant=true parameter can ignore the wrong shard, only query good shards.

    4. Cancel Document Uniqueness Check (Disable unique document check): By default, SOLR checks the primary key for duplicates to avoid the use of the same primary key for different document. If you confirm that your document does not have duplicate primary keys, add the parameter allowdups=true to the URL to cancel the check, and for the SCV document, use Overwrite=false.

    5. Depending on the purpose of setting the field type, the string type is not participle, I would like to implement the effect of a relational database, preferably with a string type, I was not experienced to use the text_general type, to get a lot of trouble.
    • Attached SOLRCTL Help

[Email protected] conf]# Solrctl--help

Usage:/usr/bin/solrctl [options] command [command-arg] [command [Command-arg]] ...

Options:

--SOLR Solr_uri

--zk zk_ensemble

--help

--quiet

Commands:

init [--force]

Instancedir [--generate Path [-schemaless]]

[--create name Path]

[--update name Path]

[--get name Path]

[--delete name]

[--list]

Collection [--create name-s <numShards>

[-A Create collection with Autoaddreplicas=true]

[-C <collection.configname>]

[-R <replicationfactor>]

[-M <maxshardspernode>]

[-n <createnodeset>]]

[--delete name]

[--reload name]

[--stat name]

[--deletedocs name]

[--list]

Core [--create name [-P name=value] ...]

[--reload name]

[--unload name]

[--status name]

cluster [--get-solrxml file]

[--put-solrxml file]

SOLR into data in CDH environment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.