Java implementation of a simple search engine

Java implementation of a simple search engine _java

Last Update:2017-01-19 Source: Internet

Author: User

Tags array sort static class stringbuffer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Remember that the Java teacher once said Baidu interview topic, probably means "there are 1W of unordered records, how to quickly find the record to their own." This is the equivalent of a simple search engine. Recently in finishing this year's work, I have already put this realization, today to its further abstraction, and share with you.

First write the specific implementation code, the specific implementation of ideas and logic written in the code after.

The bean used for sorting when searching

 /**  
 * @Description: * *  
package cn.lulei.search.engine.model;  
  
public class Sortbean { 
  private String ID; 
  private int times; 
   
  Public String GetId () {return 
    ID; 
  } 
  public void SetId (String id) { 
    this.id = ID; 
  } 
  public int Gettimes () {return times 
    ; 
  } 
  public void settimes (int times) { 
    this.times = times; 
  } 
}

Constructed search data structure and simple search algorithm

 /** * @Description: * * Package cn.lulei.search.engine; 
Import java.util.ArrayList; 
Import java.util.Collections; 
Import Java.util.Comparator; 
Import Java.util.HashMap; 
Import Java.util.HashSet; 
 
Import java.util.List; 
  
Import Cn.lulei.search.engine.model.SortBean;  public class Serachbase {//details stores the details of a search object, where key is used as the unique identifier for object identification private hashmap<string, object> details = 
  New hashmap<string, object> (); For the keyword to participate in search, the sparse array storage used here can also be stored by HASHMAP, the definition format is as follows//private static Hashmap<integer, hashset<string>> 
  Keysearch = new Hashmap<integer, hashset<string>> (); 
  The HashMap median key value corresponds to a subscript in a sparse array, which is equivalent to the value of a sparse array at that location private final static int maxLength = Character.max_value; 
   
  @SuppressWarnings ("unchecked") private hashset<string>[] Keysearch = new Hashset[maxlength]; /** * @Description: Implementation of single case mode, with initialization on Demand holder load * @Version: 1.1.0/private Static class Lazyl oadserachbase {PrivAte static final serachbase serachbase = new Serachbase (); /** * Here the construction method is set to private for single case mode/private Serachbase () {}/** * @return * @Descri 
  Ption: Get a single case/public static serachbase Getserachbase () {return lazyloadserachbase.serachbase; 
    /** * @param ID * @return * @Description: Get detailed */public Object getObject (String ID) based on ID { 
  return Details.get (ID); /** * @param ids * @return * @Description: Based on IDs get detailed, between IDs with "," separate/public list<object> GetObjects (String IDs) {if (ids = null | | 
    ". Equals (IDs)) {return null; 
    } list<object> Objs = new arraylist<object> (); 
    string[] IDArray = Ids.split (","); 
    for (String Id:idarray) {Objs.add (GetObject (id)); 
  return OBJS; /** * @param key * @return * @Description: According to search terms to find the corresponding id,id between the "," split/public String GetIDs (St Ring key) {if(key = = NULL | | 
    '. Equals (key) {return null; }//Find//idtimes Store search terms whether each character appears hashmap<string in ID, integer> idtimes = new hashmap<string, INTEGER&G 
    t; (); 
     
    IDs stores the IDs of the characters that appear in the search term hashset<string> ids = new hashset<string> (); 
      Find for in the search library for (int i = 0; i < key.length (); i++) {int at = Key.charat (i); 
      There is no corresponding character in the search thesaurus, then the next character match if (keysearch[at] = null) {continue; 
        For (Object Obj:keysearch[at].toarray ()) {String id = (string) obj; 
        int times = 1; 
          if (Ids.contains (id)) {times + = Idtimes.get (ID); 
        Idtimes.put (ID, times); 
          else {ids.add (ID); 
        Idtimes.put (ID, times); 
    Use array sort list<sortbean> Sortbeans = new arraylist<sortbean> (); 
      for (String id:ids) {Sortbean Sortbean = new Sortbean (); 
      Sortbeans.add (Sortbean); SortBean.setid (ID); 
    Sortbean.settimes (Idtimes.get (id)); 
        Collections.sort (Sortbeans, New comparator<sortbean> () {public int compare (Sortbean O1, Sortbean O2) { 
      Return O2.gettimes ()-o1.gettimes (); 
     
    } 
    }); 
    Build return string StringBuffer sb = new StringBuffer (); 
      for (Sortbean Sortbean:sortbeans) {sb.append (Sortbean.getid ()); 
    Sb.append (","); 
    }//release of resources idtimes.clear (); 
    Idtimes = null; 
    Ids.clear (); 
    ids = null; 
    Sortbeans.clear (); 
     
    Sortbeans = null; 
  Returns return Sb.tostring (); /** * @param ID * @param searchkey * @param obj * @Description: Add search Record/public void Add ( 
      String ID, String searchkey, Object obj {//parameter is partially empty, do not load if (id = null | | searchkey = NULL | | | obj = NULL) { 
    Return 
    //Save object Details.put (ID, obj); 
  Save the search term Addsearchkey (ID, searchkey); 
   }/** * @param ID* @param searchkey * @Description: Add search terms to the search domain/private void Addsearchkey (string id, string searchkey) { 
    parameter is partially empty, not loaded//This is a private method that can be judged without the following, but in order to design the specification, add if (id = = NULL | | searchkey = NULL) {return; //The following is a character participle, where you can also use the mature other word breaker for (int i = 0; i < searchkey.length (); i++) {The//at value is equivalent to the subscript of the array, the ID consists of HA 
      Shset is equivalent to the value of an array int at = Searchkey.charat (i); 
        if (keysearch[at] = = null) {hashset<string> value = new hashset<string> (); 
      Keysearch[at] = value; 
    } keysearch[at].add (ID); 
 } 
  } 
   
   
 
}

Test Cases:

 /**  
 * @Description: * *  
package cn.lulei.search.engine.test;  
 
Import java.util.List; 
 
Import Cn.lulei.search.engine.SerachBase; 
  
public class Test {public 
  static void Main (string[] args) { 
    //TODO auto-generated method stub  
    serachbase ser Achbase = Serachbase.getserachbase (); 
    Serachbase.add ("1", "Hello!") "Hello," said the man. "); 
    Serachbase.add ("2", "Hello!") I'm John. "Hello," said the man. I'm John. "); 
    Serachbase.add ("3"), "The weather is very good today." "The weather is very good today." "); 
    Serachbase.add ("4", "Who Are You?") "And who are you?" "); 
    Serachbase.add ("5", "high number of this subject difficult", "high number is really difficult." "); 
    Serachbase.add ("6", "Test", "above just Test"); 
    String ids = Serachbase.getids ("Your high number"); 
    System.out.println (IDs); 
    List<object> OBJS = serachbase.getobjects (IDs); 
    if (Objs!= null) {for 
      (Object obj:objs) { 
        System.out.println ((String) obj); 
      } 
  }}

The results of the test output are as follows:

5,3,2,1,4,
High numbers are really difficult.
The weather today is very good.
How do you do? I'm John.
How do you do?
who are you?

Such a simple search engine is completed.

Problem One: This word is used in the character participle, the processing of Chinese is quite good, but the processing of English is very weak.

Improved methods: The use of mature word segmentation methods, such as Ikanalyzer, StandardAnalyzer, and so on, so modified, the KEYSEARCH data structure needs to make a change, can be modified for private hashmap<string, string>[] Keysearch = new Hashmap[maxlength]; Where key stores the word element, value stores a unique identification ID.

question two: This article realizes the search engine word element does not like the Lucene to set the weight, just simply determines whether the lexical element appears in the object.

How to improve: no temporary. Add weight processing, so that the data structure more complex, so for the time being do not deal with it, in the future article will realize the weight of processing.

The following is a simple introduction to the search engine implementation ideas .

Set the details and Keysearch two properties in the Serachbase class to store the details of the object, Keysearch to index the search field. The details data format is hashmap,keysearch in a sparse array (or it can be a Hashmap,hashmap value equivalent to a subscript in a sparse array, with value equivalent to a sparse array at that location).

I don't make too many introductions to details.

Keysearch in the array subscript (such as using HashMap is key) is to obtain the first character int value of the word (because this article is a character participle, so a character is a word), the int value is the subscript of the array, The corresponding array value is the unique identifier of object. So the Keysearch data structure is shown below

So you just need to call the Add method when you want to add a new record.

The implementation logic for the search is similar to the keysearch above. The search for IDs directly uses the HashMap get method. For search terms of a search, the whole process is also the use of first word, second query, the final sort. Of course, this inside of the participle to and create the use of participle to be consistent (that is, when the creation of the use of character segmentation, when the search is also used character participle).

In the GetIDs method, hashmap<string, integer> idtimes = new hashmap<string, integer> (); idtimes Variables are used to store the number of words in the search term that appear in the Keysearch, and the key value is uniquely identified Id,value as the number of words that appear. hashset<string> ids = new hashset<string> (); IDS variables are used to store IDs for occurrences of the word element. The complexity of this search is the number of words in the search term N. Get IDs that contain word elements, construct sortbean arrays, sort them, and sort them in descending order with the number of words. Finally, the IDs string is returned, and each ID is segmented with ",". To get more information
Use the GetObjects method again.

The above is just a simple search engine, and did not design too many methods of calculation, I hope that the learning of everyone inspired.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More