Using the IBM OmniFind Enterprise Edition combined with the taxonomy dictionary file to implement classification based on search results

Last Update:2017-02-27 Source: Internet

Author: User

Tags expression regular expression

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes how to use IBM OmniFind EntERPrise Edition with IBM Open source unstructured Information management Architecture unstructured Information Management Architecture (UIMA), To extend the semantic search and result classification of the IBM OmniFind Enterprise Edition search engine. And through a concrete example, to show the IBM OmniFind Enterprise Edition powerful semantic search capabilities.

Background information

Search engine is used to actively search the data information in the computer, and analyze it automatically indexed, the index content is stored in a large database to query. When the user makes a query, the search engine tells the user where to find the content and provides the relevant links for the query.

In the current era of information explosion, it is very important to find the necessary data effectively. A large amount of data information needs to be automatically indexed and provided for searching. So the role of the search engine becomes more and more important. And how to search for the required data, how to ensure the quality of search has become the current search engine developers to solve the problem first.

The current search engine's main mode of work is based on keyword search. For example, Google, AltaVista, Excite, Baidu and so on. They create their own databases by extracting information from various websites on the Internet, and provide users with keyword query services. When users search for information by keyword, search engines search the database, and if they find content that matches the user's requirements, the link to the result is returned to the user. Keyword based search in the current search engine used the main means, but based on keyword search has a fatal weakness, is the request for search content must contain a lookup in the keyword input. In this way, the results of the search are strictly restricted. For example: When we find "natural disaster" as a keyword, search engine returned the content of the results must contain the word "natural disaster", but related to natural disasters such as: earthquakes, volcanoes, tsunamis, tornadoes, debris flow, such as a series of information, search engines can not find.

So IBM OmniFind EntERPrise Edition Enterprise-Class search engine combines the IBM Unstructured Information Management Architecture unstructured Information Management Architecture (UIMA), It effectively realizes the classification of semantic search and its result, and solves this problem. The search engine manager only needs to configure, and according to need to write certain procedure, can let the search engine have certain "intelligence" sex. This will not only find out the results of keyword search, but also to find out some content related to the keyword.

Implementation principle

The principle of uima realization of semantic analysis engine

First you need to know what is Uima. Structured Information Management Architecture unstructured Information Management Architecture (UIMA) is open source for IBM to search for specific text and even concepts in word processing documents, e-mail, video, and other unstructured information. Uima is a bridge to transform unstructured data into structured data and a standard tool for the analysis and reprocessing of information content.

UIMA the process of parsing a file to establish a semantic index includes the following steps:

To make semantic analysis of a file requires a particular standard method for parsing complex strings, and in Uima we use regular expressions more often. Therefore, it is necessary to establish semantic rules and create corresponding regular expressions.

Depending on the specific regular expression you create, Uima matches the contents of the file, and for a string that matches the rule, Uima creates a annotation object that contains 3 key properties: The start position of the string, the end of the string, and the string semantic index keyword. The object is then added to the Uima Semantic index. Therefore, when there are many strings in the file that match a particular regular expression, the UIMA semantic index contains the corresponding number of annotation objects.

To give a simple example: in the case of Animal, our. txt document contains the words "Animal", "Pet", "Dog", "tiger", but we need to use Uima to analyze the entire document semantically, and then classify all the animals ' words as " Animal ". So the corresponding regular expression we created is:

Listing 1. A regular expression of an animal expression

private Pattern animal = Pattern.compile("Animal|pet|sheep|tiger|lion|cat|dog|duck"); |-------10--------20--------30--------40--------50--------60--------70--------80--------9| |-------- XML error:　The previous line is longer than the max of 90 characters ---------|

This extracts all content in the contents of the file by matching, then creates a annotation object for each string, which records the starting position of a string and assigns a common semantic keyword, "animal". So the semantic keyword is the entry point, through which you can find the exact location of all the words in the file that meet the criteria.

Figure 1. The process of establishing semantic indexing

The principle of semantic search and classification using IBM OmniFind EntERPrise Edition combined with Uima

Based on semantic search and classification of the implementation, is through the IBM OmniFind Enterprise Edition Keyword index combination unstructured information Management Architecture (UIMA) is implemented by semantic analysis. After the IBM OmniFind Enterprise Edition searches for the contents of the file based on the keyword, it stores the results of semantic analysis based on UIMA in the Semantic Query index (semantic search). In this way, when searching, the program will also go into the Semantic query index to continue querying in addition to the indexed keyword, thus returning the results based on keyword lookup and semantic lookup to the user at the same time.

Uima the process of implementing semantic analysis and adding results to the index was implemented at the parse phase of the IBM OmniFind Enterprise Edition. The IBM OmniFind Enterprise Edition parses the collected files and builds the semantic index according to the rules defined in Uima. Finally, the keyword index and the semantic index are added together into the index file. As shown in the figure:

Figure 2. The process of implementing semantic analysis and adding results to the index

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More