How to build indexes using spark massively parallel

Last Update:2016-02-01 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Building an index with spark is simple, because spark provides a more advanced, abstract Rdd distributed elastic dataset that builds large-scale indexes compared to previous mapreduce using Hadoop, and Spark has a number of advantages such as more flexible API operations, higher performance, and more concise syntax.

First look at the overall topology diagram:

Then, take a look at the spark program written in Scala:

Java code

Package Com.easy.build.index
Import Java.util
Import Org.apache.solr.client.solrj.beans.Field
Import Org.apache.solr.client.solrj.impl.HttpSolrClient
Import Org.apache.spark.rdd.RDD
Import Org.apache.spark. {sparkconf, Sparkcontext}
Import Scala.annotation.meta.field
/**
* Created by Qindongliang on 2016/1/21.
*/
Register the model, the time type can be a string, as long as the background index is configured as long, the annotation map form as follows
Case class Record (
@ (Field@field) ("Rowkey") rowkey:string,
@ (Field@field) ("title") title:string,
@ (Field@field) ("content") content:string,
@ (Field@field) ("Isdel") isdel:string,
@ (Field@field) ("T1") t1:string,
@ (Field@field) ("T2") t2:string,
@ (Field@field) ("T3") t3:string,
@ (Field@field) ("Dtime") dtime:string
)
/***
* Spark Build Index ==>SOLR
*/
Object Sparkindex {
//SOLR Client
Val client=New Httpsolrclient ("Http://192.168.1.188:8984/solr/monitor");
number of articles submitted by//batch
Val batchcount=10000;
def main2 (Args:array[string]) {
Val d1=New Record ("Row1","title", "content","1","n","", "+","3");
Val d2=New Record ("Row2","title", "content","1","n","", "+","45");
Val d3=New Record ("row3","title", "content","1","n", "$","$",null);
Client.addbean (D1);
Client.addbean (D2)
Client.addbean (D3)
Client.commit ();
println ("submitted successfully! ")
}
/***
* Iterate partition data (an iterator collection) and process
* @param lines processing data for each partition
*/
Def indexpartition (Lines:scala. Iterator[string]): Unit ={
//Initialize the collection, you can initialize some content, such as database connection, before the partition iteration begins
Val datas = new util. Arraylist[record] ()
//iterate over each piece of data and meet the criteria to submit the data
Lines.foreach (Line=>indexlinetomodel (Line,datas))
//After the operation partition is finished, you can close some resources, or do some operations, the last time you commit the data
COMMITSOLR (Datas,true);
}
/***
* Submit index data to SOLR
*
* @param datas Index data
* @param isend is the last commit
*/
Def COMMITSOLR (Datas:util. Arraylist[record],isend:boolean): Unit ={
committed only when the last commit and the collection length equals the number of batches
if ((Datas.size () >0&&isend) | | Datas.size () ==batchcount) {
Client.addbeans (datas);
Client.commit (); //Submit data
Datas.clear (); //Clear collection for easy reuse
}
}
/***
* Get partitioned data specific per row, and map
* To model for subsequent indexing processing
*
* @param line specific data
* @param datas Add a collection of data for bulk-submit indexes
*/
Def Indexlinetomodel (Line:string,datas:util. Arraylist[record]): Unit ={
//Array data Cleaning conversion
Val fields=line.split ("\1",-1). map (field =>etl_field (field))
//Map the cleaned array into a tuple type
Val tuple=buildtuble (Fields)
//Convert tuple to bean type
Val recoder=record.tupled (tuple)
//Add entity classes to the collection for easy batch submission
Datas.add (Recoder);
//Submit index to SOLR
COMMITSOLR (Datas,false);
}
/***
* Map arrays into tuple collections for easy binding with beans
* @param array Field Collection
* @return Tuple Collection
*/
def buildtuble (array:array[string]):(string, String, String, String, String, String, String, ={
Array Match {
Case Array (S1, S2, S3, S4, S5, S6, S7, s8) = (S1, S2, S3, S4, S5, S6, S7,S8)
}
}
/***
* Processing of field
* Null value is replaced with NULL, so the index does not index this field
* The normal value is returned as it is.
*
* @param field used to take specific rules of data
* @return The data that was mapped out
*/
def Etl_field (field:string): string={
Field Match {
case "" = null
Case _ = Field
}
}
/***
* Purge a class of indexed data based on conditions
* @param query criteria to delete
*/
def deletesolrbyquery (query:string): Unit ={
Client.deletebyquery (query);
Client.commit ()
println ("Delete succeeded!")
}
def main (args:array[string]) {
//Delete some data based on conditions
Deletesolrbyquery ("t1:03")
//Remote Commit, you need to submit the packaged Jar
Val jarpath = "Target\\spark-build-index-1.0-snapshot.jar";
//Remote Commit, disguised as the relevant Hadoop user, otherwise, may not have access to the HDFS system
System.setproperty ("User.Name", "webmaster");
//Initialize sparkconf
Val conf = new sparkconf (). Setmaster ("spark://192.168.1.187:7077"). Setappname ("Build Index");
//upload the runtime-dependent jar package
Val seq = seq (jarpath): + "D:\\tmp\\lib\\noggit-0.6.jar": + "D:\\tmp\\lib\\httpclient-4.3.1.jar": + "d:\\ Tmp\\lib\\httpcore-4.3.jar ": + " D:\\tmp\\lib\\solr-solrj-5.1.0.jar ": + " D:\\tmp\\lib\\httpmime-4.3.1.jar "
Conf.setjars (seq)
//Initialize Sparkcontext context
Val sc = new Sparkcontext (conf);
all data in this directory will be indexed and the format must be agreed
Val Rdd = Sc.textfile ("hdfs://192.168.1.187:9000/user/monitor/gs/");
//Build an index with an RDD
Indexrdd (RDD);
//Close index resource
Client.close ();
//Close Sparkcontext context
Sc.stop ();
}
/***
* Process RDD data, build index
* @param RDD
*/
def Indexrdd (rdd:rdd[string]): Unit ={
//traversing partitions, building indexes
Rdd.foreachpartition (Line=>indexpartition (line));
}
}

OK, so far, our build index program is finished, this example is used in the remote commit mode, in fact it can also support spark on yarn (cluster or client) mode, but it is important to note that you do not need to explicitly specify the value of Setmaster, and by the time the task is submitted , through the--master to specify the mode of operation, in addition, dependent on the relevant jar package, also need to be submitted to the cluster through the--jars parameter, otherwise, the runtime will report an exception, and finally see the example of SOLR is a stand-alone mode, so using spark to build an index does not reach the maximum value , the most powerful thing is that many search clusters, as I draw the architecture diagram, each machine is a shard, this is the Solrcloud mode, or in the Elasticsearch cluster Shard, so that can really achieve efficient batch index construction

How to build indexes using spark massively parallel

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More