Building an index with spark is simple, because spark provides a more advanced, abstract Rdd distributed elastic dataset that builds large-scale indexes compared to previous mapreduce using Hadoop, and Spark has a number of advantages such as more flexible API operations, higher performance, and more concise syntax.
First look at the overall topology diagram:
Then, take a look at the spark program written in Scala:
Java code
- Package Com.easy.build.index
- Import Java.util
- Import Org.apache.solr.client.solrj.beans.Field
- Import Org.apache.solr.client.solrj.impl.HttpSolrClient
- Import Org.apache.spark.rdd.RDD
- Import Org.apache.spark. {sparkconf, Sparkcontext}
- Import Scala.annotation.meta.field
- /**
- * Created by Qindongliang on 2016/1/21.
- */
- Register the model, the time type can be a string, as long as the background index is configured as long, the annotation map form as follows
- Case class Record (
- @ (Field@field) ("Rowkey") rowkey:string,
- @ (Field@field) ("title") title:string,
- @ (Field@field) ("content") content:string,
- @ (Field@field) ("Isdel") isdel:string,
- @ (Field@field) ("T1") t1:string,
- @ (Field@field) ("T2") t2:string,
- @ (Field@field) ("T3") t3:string,
- @ (Field@field) ("Dtime") dtime:string
- )
- /***
- * Spark Build Index ==>SOLR
- */
- Object Sparkindex {
- //SOLR Client
- Val client=New Httpsolrclient ("Http://192.168.1.188:8984/solr/monitor");
- number of articles submitted by//batch
- Val batchcount=10000;
- def main2 (Args:array[string]) {
- Val d1=New Record ("Row1","title", "content","1","n","", "+","3");
- Val d2=New Record ("Row2","title", "content","1","n","", "+","45");
- Val d3=New Record ("row3","title", "content","1","n", "$","$",null);
- Client.addbean (D1);
- Client.addbean (D2)
- Client.addbean (D3)
- Client.commit ();
- println ("submitted successfully! ")
- }
- /***
- * Iterate partition data (an iterator collection) and process
- * @param lines processing data for each partition
- */
- Def indexpartition (Lines:scala. Iterator[string]): Unit ={
- //Initialize the collection, you can initialize some content, such as database connection, before the partition iteration begins
- Val datas = new util. Arraylist[record] ()
- //iterate over each piece of data and meet the criteria to submit the data
- Lines.foreach (Line=>indexlinetomodel (Line,datas))
- //After the operation partition is finished, you can close some resources, or do some operations, the last time you commit the data
- COMMITSOLR (Datas,true);
- }
- /***
- * Submit index data to SOLR
- *
- * @param datas Index data
- * @param isend is the last commit
- */
- Def COMMITSOLR (Datas:util. Arraylist[record],isend:boolean): Unit ={
- committed only when the last commit and the collection length equals the number of batches
- if ((Datas.size () >0&&isend) | | Datas.size () ==batchcount) {
- Client.addbeans (datas);
- Client.commit (); //Submit data
- Datas.clear (); //Clear collection for easy reuse
- }
- }
- /***
- * Get partitioned data specific per row, and map
- * To model for subsequent indexing processing
- *
- * @param line specific data
- * @param datas Add a collection of data for bulk-submit indexes
- */
- Def Indexlinetomodel (Line:string,datas:util. Arraylist[record]): Unit ={
- //Array data Cleaning conversion
- Val fields=line.split ("\1",-1). map (field =>etl_field (field))
- //Map the cleaned array into a tuple type
- Val tuple=buildtuble (Fields)
- //Convert tuple to bean type
- Val recoder=record.tupled (tuple)
- //Add entity classes to the collection for easy batch submission
- Datas.add (Recoder);
- //Submit index to SOLR
- COMMITSOLR (Datas,false);
- }
- /***
- * Map arrays into tuple collections for easy binding with beans
- * @param array Field Collection
- * @return Tuple Collection
- */
- def buildtuble (array:array[string]):(string, String, String, String, String, String, String, ={
- Array Match {
- Case Array (S1, S2, S3, S4, S5, S6, S7, s8) = (S1, S2, S3, S4, S5, S6, S7,S8)
- }
- }
- /***
- * Processing of field
- * Null value is replaced with NULL, so the index does not index this field
- * The normal value is returned as it is.
- *
- * @param field used to take specific rules of data
- * @return The data that was mapped out
- */
- def Etl_field (field:string): string={
- Field Match {
- case "" = null
- Case _ = Field
- }
- }
- /***
- * Purge a class of indexed data based on conditions
- * @param query criteria to delete
- */
- def deletesolrbyquery (query:string): Unit ={
- Client.deletebyquery (query);
- Client.commit ()
- println ("Delete succeeded!")
- }
- def main (args:array[string]) {
- //Delete some data based on conditions
- Deletesolrbyquery ("t1:03")
- //Remote Commit, you need to submit the packaged Jar
- Val jarpath = "Target\\spark-build-index-1.0-snapshot.jar";
- //Remote Commit, disguised as the relevant Hadoop user, otherwise, may not have access to the HDFS system
- System.setproperty ("User.Name", "webmaster");
- //Initialize sparkconf
- Val conf = new sparkconf (). Setmaster ("spark://192.168.1.187:7077"). Setappname ("Build Index");
- //upload the runtime-dependent jar package
- Val seq = seq (jarpath): + "D:\\tmp\\lib\\noggit-0.6.jar": + "D:\\tmp\\lib\\httpclient-4.3.1.jar": + "d:\\ Tmp\\lib\\httpcore-4.3.jar ": + " D:\\tmp\\lib\\solr-solrj-5.1.0.jar ": + " D:\\tmp\\lib\\httpmime-4.3.1.jar "
- Conf.setjars (seq)
- //Initialize Sparkcontext context
- Val sc = new Sparkcontext (conf);
- all data in this directory will be indexed and the format must be agreed
- Val Rdd = Sc.textfile ("hdfs://192.168.1.187:9000/user/monitor/gs/");
- //Build an index with an RDD
- Indexrdd (RDD);
- //Close index resource
- Client.close ();
- //Close Sparkcontext context
- Sc.stop ();
- }
- /***
- * Process RDD data, build index
- * @param RDD
- */
- def Indexrdd (rdd:rdd[string]): Unit ={
- //traversing partitions, building indexes
- Rdd.foreachpartition (Line=>indexpartition (line));
- }
- }
OK, so far, our build index program is finished, this example is used in the remote commit mode, in fact it can also support spark on yarn (cluster or client) mode, but it is important to note that you do not need to explicitly specify the value of Setmaster, and by the time the task is submitted , through the--master to specify the mode of operation, in addition, dependent on the relevant jar package, also need to be submitted to the cluster through the--jars parameter, otherwise, the runtime will report an exception, and finally see the example of SOLR is a stand-alone mode, so using spark to build an index does not reach the maximum value , the most powerful thing is that many search clusters, as I draw the architecture diagram, each machine is a shard, this is the Solrcloud mode, or in the Elasticsearch cluster Shard, so that can really achieve efficient batch index construction
How to build indexes using spark massively parallel