Solutions for synonyms and associated words in full-text search

Source: Internet
Author: User
I have been trying to find a good synonym solution. Baidu and Google are all just a few words about this problem. I don't want to clarify it. I can't find this information in javaeye, then I sent a question post, and I only had to view the number of pages without replying. I don't know why, but I only had to study it myself.

Because there are no other solutions for reference, the following are my personal opinions.

 

In my opinion, the retrieval of synonyms and associated words is similar to the following three forms:

1. Similar to Google suggest, the system automatically prompts users after entering keywords.

2. If the word "Beijing" is associated with the "Olympic Games", you can search the "Olympic Games" search results and the "Beijing" search results.

3. the user enters the "Olympic Games" Search and only displays the results of the "Olympic Games". Its Related Words are displayed as "related searches" at the bottom of the result set.

 

 

The three methods seem to be similar to the two and the three. In fact, they are quite different. I personally think that the 3rd methods are the most common in actual applications.

 

Let's talk about the 1st methods first

Google suggest can find the complete solution on the Internet, as I previously thought.

Create a table for storing keywords

For example:

SQL Code
    1. Create TableKeys (
    2. Item_id varcha2 (50)Not Null,
    3. Search _key varchar2 (100 ),
    4. );
 
Create Table keys (item_id varcha2 (50) not null, search_key varchar2 (100 ),);

Search_key stores some search keywords, such as "Olympics", "Olympics", and "Beijing Olympics.

When you enter the text box, use ajax to send the content of the current text box to the action for a similar operation.

Java code
    1. String key = request. getparameter ("Inputvalue");
    2. String SQL ="Select * From keys as K where K. search_key like '% key % '";
 
String key = request. getparameter ("inputvalue"); string SQL = "select * From keys as K where K. search_key like '% key % '";

This method is to put the words in the current text box into the database for fuzzy query. For example, if you enter the word "Austria", all words containing the word "Austria" will be found, then wrap these words and send them back to the page. Draw a drop-down box using JS and enter these words containing the word "Ao.

Google uses the like 'key % 'on the right side of the match. If you enter the word "Austria", all the words starting with "Austria" will be displayed.

You should note that there are "about XXXXX records" at the end of the prompt word for Google suggest. The only solution I can think of now is, when we get the matching words in the action, we can search the matching words once to get the number of specific records. In fact, this does not consume much resources.

 

2nd Methods

Search the results of the words associated with the keywords

I personally think that this feature only needs to be used in actual applications, because we must ensure the quality of search results in terms of user experience. The synonym of a keyword can be prompted to the user at the bottom of the result page, that is, the 3rd method I mentioned above.

I only need to use 5%, which means to implement this function only for a few words that must be escaped.

For example, if you search for "China", then the results of the word "China" will surely come to the current result set, and for example, if you search for "08", then the results of "2008" will also come to the current result set, these are some specific situations. Other words do not need to be made in this way. You only need to make a prompt.

Currently, Chinese search is generally based on word segmentation. To achieve the 2nd methods, you must use the previous word segmentation and word segmentation.Algorithm.

The general method is to write synonyms into a single line in the dictionary, such as "China" in the same line, and then modify the token algorithm (you can study it here, the analyzer package I am using is a testing package provided by my company's commercial partners. This testing package can already meet a large part of my needs. I still don't know if it can be shared)

In fact, the principle is that the original word and the synonym of the original word are indexed together during indexing.

Here is an example.

There is one such sentence: "China is the most populous country in the world"

If no synonym is used for it, you may index it like this.

[China] [World] [population] [most] [country]

If a synonym is used, it is indexed

[China] [China] [World] [population] [Largest] [country]

That is to say, the word "China" has been saved as [China] [China] During indexing, and this sentence can be found no matter "China" or "China" during search.

In fact, a large part of this function is dependent on third-party jar packages, which is not very interesting. In most cases, we need different functions, but a more user-friendly query prompt function, that is, 3rd methods.

 

3rd Methods

This method was improved on the original system.

The original system is compass + paoding + Lucene.

Because I am not familiar with Compass search, I still use Lucene search. I believe that you are very familiar with the full-text search of such a match. The paoding dictionary can be configured by yourself.

So how to implement the keyword prompt function on the original basis ???

My practice is like this. I created a new mydictionary. DIC file based on the features of the paoding dictionary and put it under classpath. The content is roughly as follows.

Row 3: Olympic Games Beijing 1st 2008 Olympic Games

Row 3: China People's Republic of China Telecom People's Bank of China

........

 

Then, when the server is started, read one line of text in the mydictionary. DIC file and index them separately.

Java code
  1. Inputstream Fi =This. Getclass (). getclassloader (). getresourceasstream ("Mydictionary. DIC");
  2. File indexdir =NewFile ("D: \ Tong");
  3. Analyzer extends eanalyzer =NewStandardanalyzer ();
  4. Indexwriter =NewIndexwriter (indexdir, luceneanalyzer,True);
  5. Bufferedreader reader =NewBufferedreader (NewInputstreamreader (FI,UTF-8"));
  6. String line =NewString ();
  7. While(Line = reader. Readline ())! =Null)
  8. {
  9. Document document =NewDocument ();
  10. Field fieldname =NewField ("Line", Line, field. Store. Yes, field. Index. tokenized );
  11. Document. Add (fieldname );
  12. Indexwriter. adddocument (document );
  13. }
  14. Indexwriter. Optimize ();
  15. Indexwriter. Close ();
  16. Reader. Close ();
Inputstream Fi = This. getclass (). getclassloader (). getresourceasstream ("mydictionary. dic "); file indexdir = new file (" D: \ Tong "); analyzer extends eanalyzer = new standardanalyzer (); indexwriter = new indexwriter (indexdir, luceneanalyzer, true ); bufferedreader reader = new bufferedreader (New inputstreamreader (FI, "UTF-8"); string line = new string (); While (line = reader. readline ())! = NULL) {document = new document (); field fieldname = new field ("line", line, field. store. yes, field. index. tokenized); document. add (fieldname); indexwriter. adddocument (document);} indexwriter. optimize (); indexwriter. close (); reader. close ();

 

After a user enters a keyword for a normal search result, the user performs a second search for the keyword. The search results are separated by spaces, you can get all synonyms and then send them to the page.

Java code
  1. PublicList gettongyi (string searchword)ThrowsException
  2. {
  3. List list =NewArraylist ();
  4. Hits hits =Null;
  5. String querystring = searchword;
  6. Query query =Null;
  7. String result ="";
  8. Indexsearcher searcher =NewIndexsearcher ("D: \ Tong");
  9. Analyzer analyzer =NewStandardanalyzer ();
  10. Queryparser QP =NewQueryparser ("Line", Analyzer );
  11. Query = QP. parse (querystring );
  12. If(Searcher! =Null)
  13. {
  14. Hits = searcher. Search (query );
  15. For(IntI =0; I
  16. {
  17. Document Doc = hits.doc (I );
  18. System. Out. println (I +1) +"."+ Doc. Get ("Line"));
  19. Result = Doc. Get ("Line");
  20. }
  21. }
  22. If(Result! =Null&&! Result. Equals (""))
  23. {
  24. String [] manyresult = result. Split ("");
  25. For(IntI =0; I <manyresult. length; I ++)
  26. {
  27. If(Manyresult [I]! =Null&&! Manyresult [I]. Trim (). Equals (""))
  28. {
  29. List. Add (manyresult [I]);
  30. }
  31. }
  32. }
  33. ReturnList;
  34. }
Public list gettongyi (string searchword) throws exception {list = new arraylist (); hits = NULL; string querystring = searchword; query = NULL; string result = ""; indexsearcher searcher = new indexsearcher ("D: \ Tong"); analyzer = new standardanalyzer (); queryparser QP = new queryparser ("line", analyzer); query = QP. parse (querystring); If (searcher! = NULL) {hits = searcher. search (query); For (INT I = 0; I  

I think standardanalyzer is the best way to create a synonym index and search. It is best to use standardanalyzer.

 

After writing so much, if you have better ideas, you may wish to study them together.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.