Java-based simple search engine and java Search Engine

Source: Internet
Author: User

Java-based simple search engine and java Search Engine

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/37956749


At school, the java teacher once said a Baidu interview question, which probably meant "there are unordered records and how to quickly find the records you want ". This is equivalent to a simple search engine. I have already implemented this in my work over the past year. Today I want to further abstract it and share it with you.

First, write the specific implementation code. The specific implementation ideas and logic are written after the code.


Bean used for sorting during search

 /**   *@Description:      */ package cn.lulei.search.engine.model;    public class SortBean {private String id;private int times;public String getId() {return id;}public void setId(String id) {this.id = id;}public int getTimes() {return times;}public void setTimes(int times) {this.times = times;}}


Constructed search data structure and simple search algorithms

/*** @ Description: */package cn. lulei. search. engine; import java. util. arrayList; import java. util. collections; import java. util. comparator; import java. util. hashMap; import java. util. hashSet; import java. util. list; import cn. lulei. search. engine. model. sortBean; public class SerachBase {// details stores the detailed information of the search Object. The key serves as the unique identifier for distinguishing objects. private HashMap <String, Object> details = new HashMap <String, object> (); // For the keywords involved in the search, sparse array storage is used here, or HashMap can be used for storage. The definition format is as follows // private static HashMap <Integer, hashSet <String> keySearch = new HashMap <Integer, HashSet <String> (); // the key value in HashMap is equivalent to the subscript In the sparse array, value is equivalent to the value of private final static int maxLength = Character in this position of the sparse array. MAX_VALUE; @ SuppressWarnings ("unchecked") private HashSet <String> [] keySearch = new HashSet [maxLength];/*** @ Description: implements the singleton mode, use Initialization on Demand Holder to load * @ Author: lulei * @ Date: 2014-7-19 * @ Version: 1.1.0 */private static class lazyLoadSerachBase {private static final SerachBase serachBase = new SerachBase ();} /*** here we set the constructor to private in singleton mode */private SerachBase () {}/ *** @ return * @ Date: 2014-7-19 * @ Author: lulei * @ Description: Obtain the singleton */public static SerachBase getSerachBase () {return lazyLoadSerachBase. serachBase;}/*** @ param id * @ return * @ Date: 2014-7-19 * @ Author: lulei * @ Description: obtain details by id */public Object getObject (String id) {return details. get (id);}/*** @ param ids * @ return * @ Date: 2014-7-19 * @ Author: lulei * @ Description: obtain details based on ids, and use ", "Separate */public List <Object> getObjects (String ids) {if (ids = null | "". equals (ids) {return null;} List <Object> objs = new ArrayList <Object> (); String [] idArray = ids. split (","); for (String id: idArray) {objs. add (getObject (id);} return objs;}/*** @ param key * @ return * @ Date: 2014-7-19 * @ Author: lulei * @ Description: find the corresponding id based on the search term, and use "," to separate */public String getIds (String key) {if (key = null | "". equals (key) {return null;} // search // idTimes stores the search term. Does each character show HashMap in the id <String, Integer> idTimes = new HashMap <String, integer> (); // ids stores the idHashSet <String> ids = new HashSet <String> () where the characters in the search term appear (); // search for (int I = 0; I <key. length (); I ++) {int at = key. charAt (I); // if no corresponding character exists in the search dictionary, match the next character if (keySearch [at] = null) {continue;} for (Object obj: keySearch [at]. toArray () {String id = (String) obj; int times = 1; if (ids. contains (id) {times + = idTimes. get (id); idTimes. put (id, times);} else {ids. add (id); idTimes. put (id, times) ;}}// use an array to sort List <SortBean> sortBeans = new ArrayList <SortBean> (); for (String id: ids) {SortBean sortBean = new SortBean (); sortBeans. add (sortBean); sortBean. setId (id); sortBean. setTimes (idTimes. get (id);} Collections. sort (sortBeans, new Comparator <SortBean> () {public int compare (SortBean o1, SortBean o2) {return o2.getTimes ()-o1.getTimes ();}}); // construct the return string StringBuffer sb = new StringBuffer (); for (SortBean sortBean: sortBeans) {sb. append (sortBean. getId (); sb. append (",") ;}// release resource idTimes. clear (); idTimes = null; ids. clear (); ids = null; sortBeans. clear (); sortBeans = null; // return sb. toString ();}/*** @ param id * @ param searchKey * @ param obj * @ Date: 2014-7-19 * @ Author: lulei * @ Description: add search record */public void add (String id, String searchKey, Object obj) {// The parameter is partially empty, if (id = null | searchKey = null | obj = null) {return;} // Save the object details. put (id, obj); // Save the search term addSearchKey (id, searchKey);}/*** @ param id * @ param searchKey * @ Date: 2014-7-19 * @ Author: lulei * @ Description: adds a search term to the search domain */private void addSearchKey (String id, String searchKey) {// The parameter is partially empty, do not load // here is a private method, you can not make the following judgment, but to design specifications, or add if (id = null | searchKey = null) {return ;} // The following uses the character segmentation. Here you can also use other mature word segmentation tools for (int I = 0; I <searchKey. length (); I ++) {// at value is equivalent to the subscript of the array. The HashSet consisting of IDS is equivalent to the value of the array int at = searchKey. charAt (I); if (keySearch [at] = null) {HashSet <String> value = new HashSet <String> (); keySearch [at] = value ;} keySearch [at]. add (id );}}}

Test cases:

/*** @ Description: */package cn. lulei. search. engine. test; import java. util. list; import cn. lulei. search. engine. serachBase; public class Test {public static void main (String [] args) {// TODO Auto-generated method stub SerachBase serachBase = SerachBase. getSerachBase (); serachBase. add ("1", "Hello! "," Hello! "); SerachBase. add (" 2 "," Hello! I am Zhang San. "," Hello! I am Zhang San. "); SerachBase. add (" 3 "," Today's weather is good. "," Today's weather is good. "); SerachBase. add (" 4 "," who are you? "," Who are you? "); SerachBase. add (" 5 "," High is difficult "," High is really difficult. "); SerachBase. add ("6", "test", "test above"); String ids = serachBase. getIds ("your high"); System. out. println (ids); List <Object> objs = serachBase. getObjects (ids); if (objs! = Null) {for (Object obj: objs) {System. out. println (String) obj );}}}}
The test output is as follows:

5, 3, 2, 1, 4,
It is really difficult to increase the number.
The weather is good today.
Hello! I am Zhang San.
Hello!
Who are you?


Such a simple search engine is complete.

Problem 1: The word segmentation here adopts the character segmentation, which is quite good for Chinese, but it is very weak for English.

Improvement Method: uses mature word segmentation methods, such as IKAnalyzer and StandardAnalyzer. In this way, the data structure of keySearch needs to be modified to private HashMap <String, string> [] keySearch = new HashMap [maxLength]; the key stores the shard metadata and the value stores the unique id.


Problem 2: the search engine in this article does not set the weight for the word element like lucene, but simply determines whether the word element appears in the object.

Improvement Method: none. Adding weight processing makes the data structure more complex, so we have not processed it for the time being, and will implement weight processing in future articles.


The following describes how to implement search engines in this blog. (Inspired by blog blocking word function implementation: http://blog.csdn.net/xiaojimanman/article/details/16852791)

Set the attributes details and keySearch in the SerachBase class. details is used to store the details of the Object, and keySearch is used to index the search domain. The details data format is HashMap. The keySearch data format is a sparse array (or HashMap. The key value in HashMap is equivalent to the subscript In the sparse array, value is equivalent to the value of the sparse array in this position ).

I will not introduce the details too much.

The Calculation Method of array subscript in keySearch (for example, HashMap is the key) is to obtain the first character int value of the word element (because the word segmentation in this article uses the character segmentation, the int value is the subscript of the array, and the corresponding array value is the unique identifier of the Object. In this way, the data structure of keySearch is as follows:

To add a new record, you only need to call the add method.


The search implementation logic is similar to the above keySearch. For id Search, use the get method of HashMap. For a search term, the overall process is to use first word segmentation, second query, and last sorting. Of course, the word segmentation here must be consistent with the word segmentation used for creation (that is, the word segmentation is used for creation, and the word segmentation is also used for search ).

In the getIds method, HashMap <String, Integer> idTimes = new HashMap <String, Integer> (); The idTimes variable is used to store the number of word elements in the search term that appear in keySearch, the key value is the unique id, and the value is the number of words that appear. HashSet <String> ids = new HashSet <String> (); ids variables are used to store ids of the tokens that appear. In this way, the search complexity is the number of words in the search term n. Obtain the ids containing the word element, construct a SortBean array, and sort it. The sorting rule is to sort the number of word elements in descending order. Returns the ids string. Each id is separated by a comma. To obtain details
Use the getObjects method.


The above is just a simple search engine and has not designed many computing methods. You are welcome to criticize it.


How to implement a simple search method written in java?

You can use the full-text java search engine. You can study Lucene. It is basically the standard in open-source search engines.

Java simple file search program code

You can use Baidu Hi to notify us
Have the opportunity to solve your problem
You can also notify us of the same requirements.

ES: \ 72AC5A535EB37676D1962C31E53D5C65

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.