How to build indexes using spark massively parallel

Source: Internet
Author: User
Tags solr

Building an index with spark is simple, because spark provides a more advanced, abstract Rdd distributed elastic dataset that builds large-scale indexes compared to previous mapreduce using Hadoop, and Spark has a number of advantages such as more flexible API operations, higher performance, and more concise syntax.

First look at the overall topology diagram:





Then, take a look at the spark program written in Scala:

Java code
  1. Package Com.easy.build.index
  2. Import Java.util
  3. Import Org.apache.solr.client.solrj.beans.Field
  4. Import Org.apache.solr.client.solrj.impl.HttpSolrClient
  5. Import Org.apache.spark.rdd.RDD
  6. Import Org.apache.spark. {sparkconf, Sparkcontext}
  7. Import Scala.annotation.meta.field
  8. /**
  9. * Created by Qindongliang on 2016/1/21.
  10. */
  11. Register the model, the time type can be a string, as long as the background index is configured as long, the annotation map form as follows
  12. Case class Record (
  13. @ (Field@field) ("Rowkey") rowkey:string,
  14. @ (Field@field) ("title") title:string,
  15. @ (Field@field) ("content") content:string,
  16. @ (Field@field) ("Isdel") isdel:string,
  17. @ (Field@field) ("T1") t1:string,
  18. @ (Field@field) ("T2") t2:string,
  19. @ (Field@field) ("T3") t3:string,
  20. @ (Field@field) ("Dtime") dtime:string
  21. )
  22. /***
  23. * Spark Build Index ==>SOLR
  24. */
  25. Object Sparkindex {
  26. //SOLR Client
  27. Val client=New Httpsolrclient ("Http://192.168.1.188:8984/solr/monitor");
  28. number of articles submitted by//batch
  29. Val batchcount=10000;
  30. def main2 (Args:array[string]) {
  31. Val d1=New Record ("Row1","title", "content","1","n","", "+","3");
  32. Val d2=New Record ("Row2","title", "content","1","n","", "+","45");
  33. Val d3=New Record ("row3","title", "content","1","n", "$","$",null);
  34. Client.addbean (D1);
  35. Client.addbean (D2)
  36. Client.addbean (D3)
  37. Client.commit ();
  38. println ("submitted successfully! ")  
  39. }
  40. /*** 
  41. * Iterate partition data (an iterator collection) and process
  42. * @param lines processing data for each partition
  43. */
  44. Def indexpartition (Lines:scala. Iterator[string]): Unit ={
  45. //Initialize the collection, you can initialize some content, such as database connection, before the partition iteration begins
  46. Val datas = new util. Arraylist[record] ()
  47. //iterate over each piece of data and meet the criteria to submit the data
  48. Lines.foreach (Line=>indexlinetomodel (Line,datas))
  49. //After the operation partition is finished, you can close some resources, or do some operations, the last time you commit the data
  50. COMMITSOLR (Datas,true);
  51. }
  52. /*** 
  53. * Submit index data to SOLR
  54. *
  55. * @param datas Index data
  56. * @param isend is the last commit
  57. */
  58. Def COMMITSOLR (Datas:util. Arraylist[record],isend:boolean): Unit ={
  59. committed only when the last commit and the collection length equals the number of batches
  60. if ((Datas.size () >0&&isend) | | Datas.size () ==batchcount) {
  61. Client.addbeans (datas);
  62. Client.commit (); //Submit data
  63. Datas.clear (); //Clear collection for easy reuse
  64. }
  65. }
  66. /*** 
  67. * Get partitioned data specific per row, and map
  68. * To model for subsequent indexing processing
  69. *
  70. * @param line specific data
  71. * @param datas Add a collection of data for bulk-submit indexes
  72. */
  73. Def Indexlinetomodel (Line:string,datas:util. Arraylist[record]): Unit ={
  74. //Array data Cleaning conversion
  75. Val fields=line.split ("\1",-1). map (field =>etl_field (field))
  76. //Map the cleaned array into a tuple type
  77. Val tuple=buildtuble (Fields)
  78. //Convert tuple to bean type
  79. Val recoder=record.tupled (tuple)
  80. //Add entity classes to the collection for easy batch submission
  81. Datas.add (Recoder);
  82. //Submit index to SOLR
  83. COMMITSOLR (Datas,false);
  84. }
  85. /*** 
  86. * Map arrays into tuple collections for easy binding with beans
  87. * @param array Field Collection
  88. * @return Tuple Collection
  89. */
  90. def buildtuble (array:array[string]):(string, String, String, String, String, String, String, ={
  91. Array Match {
  92. Case Array (S1, S2, S3, S4, S5, S6, S7, s8) = (S1, S2, S3, S4, S5, S6, S7,S8)
  93. }
  94. }
  95. /*** 
  96. * Processing of field
  97. * Null value is replaced with NULL, so the index does not index this field
  98. * The normal value is returned as it is.
  99. *
  100. * @param field used to take specific rules of data
  101. * @return The data that was mapped out
  102. */
  103. def Etl_field (field:string): string={
  104. Field Match {
  105. case "" = null
  106. Case _ = Field
  107. }
  108. }
  109. /*** 
  110. * Purge a class of indexed data based on conditions
  111. * @param query criteria to delete
  112. */
  113. def deletesolrbyquery (query:string): Unit ={
  114. Client.deletebyquery (query);
  115. Client.commit ()
  116. println ("Delete succeeded!")
  117. }
  118. def main (args:array[string]) {
  119. //Delete some data based on conditions
  120. Deletesolrbyquery ("t1:03")
  121. //Remote Commit, you need to submit the packaged Jar
  122. Val jarpath = "Target\\spark-build-index-1.0-snapshot.jar";
  123. //Remote Commit, disguised as the relevant Hadoop user, otherwise, may not have access to the HDFS system
  124. System.setproperty ("User.Name", "webmaster");
  125. //Initialize sparkconf
  126. Val conf = new sparkconf (). Setmaster ("spark://192.168.1.187:7077"). Setappname ("Build Index");
  127. //upload the runtime-dependent jar package
  128. Val seq = seq (jarpath): + "D:\\tmp\\lib\\noggit-0.6.jar": + "D:\\tmp\\lib\\httpclient-4.3.1.jar": + "d:\\  Tmp\\lib\\httpcore-4.3.jar ": + " D:\\tmp\\lib\\solr-solrj-5.1.0.jar ": + " D:\\tmp\\lib\\httpmime-4.3.1.jar "
  129. Conf.setjars (seq)
  130. //Initialize Sparkcontext context
  131. Val sc = new Sparkcontext (conf);
  132. all data in this directory will be indexed and the format must be agreed
  133. Val Rdd = Sc.textfile ("hdfs://192.168.1.187:9000/user/monitor/gs/");
  134. //Build an index with an RDD
  135. Indexrdd (RDD);
  136. //Close index resource
  137. Client.close ();
  138. //Close Sparkcontext context
  139. Sc.stop ();
  140. }
  141. /*** 
  142. * Process RDD data, build index
  143. * @param RDD
  144. */
  145. def Indexrdd (rdd:rdd[string]): Unit ={
  146. //traversing partitions, building indexes
  147. Rdd.foreachpartition (Line=>indexpartition (line));
  148. }
  149. }



OK, so far, our build index program is finished, this example is used in the remote commit mode, in fact it can also support spark on yarn (cluster or client) mode, but it is important to note that you do not need to explicitly specify the value of Setmaster, and by the time the task is submitted , through the--master to specify the mode of operation, in addition, dependent on the relevant jar package, also need to be submitted to the cluster through the--jars parameter, otherwise, the runtime will report an exception, and finally see the example of SOLR is a stand-alone mode, so using spark to build an index does not reach the maximum value , the most powerful thing is that many search clusters, as I draw the architecture diagram, each machine is a shard, this is the Solrcloud mode, or in the Elasticsearch cluster Shard, so that can really achieve efficient batch index construction

How to build indexes using spark massively parallel

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.