Oracle9i Full-Text search technology

Source: Internet
Author: User
Tags filter file system include insert new set time interval alphanumeric characters oracle database
oracle| Full-Text search technology is one of the key technologies of intelligent information Management, Oracle text as a component of Oracle9i provides a powerful Full-text search function, using oracle9i to do background database, you can fully use its Full-text search technology, Build a complex large document management system. This paper mainly introduces the architecture and use of Oracle text.

Keyword Oracle text full-text search

Oracle has been committed to the full text search technology, when the Oracle9i RLEASE2 release, the Oracle database Full-text search technology has been very perfect, Oracle text enables oracle9i has a strong text retrieval capabilities and intelligent text management capabilities. Oracle text is the new name used by Oracle9i, which is called Oracle intermedia text in oracle8/8i, and its name is Oracle context cartridge before Oracle8. Using Oracle9i and Oracle text makes it easy and efficient to leverage standard SQL tools to build new text-based development tools or extend existing applications. Application developers can make full use of Oracle text searches in any Oracle database application that uses text, which can be searchable annotation fields in existing applications, but also large document management systems that involve multiple document formats and complex search criteria. Oracle text supports basic Full-text search capabilities in most languages supported by Oracle databases. This article will introduce how to use oracle9i Full-text search technology to provide a good solution for your application.
 
1 Oracle-Text architecture

The following figure is the Oracle text architecture.


Figure 1 Architecture for Oracle text

Based on the architecture diagram above, the main logical steps used by the Oracle Text Index document are as follows:

(1) The data stores all the rows of the logical search table and reads the data in the column. Typically, this is just column data, but some data stores pointers to document data using column data. For example, Url_datastore uses column data as a URL.

(2) The filter extracts the document data and converts it to a textual representation. This is necessary when storing binary documents, such as Word or Acrobat files. The output of a filter does not have to be in plain text format-it can be a text format such as XML or HTML.

(3) The segmented device extracts the output information of the filter and converts it to plain text. Different text formats, including XML and HTML, have different segments. Converting to plain text involves detecting important document segment tags, removing invisible information, and reformatting text.

(4) The lexical analyzer extracts the plain text in the segmented device and splits it into discontinuous markers. There are lexical parsers used in whitespace-delimited languages, as well as specialized lexical parsers used in segmented and complex Asian languages.

(5) The index engine extracts all the tags in the lexical analyzer, the offset of the document segment in the segment, and the Low information content Word list called the non-indexed word, and constructs the reverse index. Inverted indexes store tags and documents that contain these tags.

2 simple examples

Here is a simple example of the use of Oracle text to achieve Full-text search methods and procedures, in the following specific instructions. Orcale9i provides Oracle text manager that simplifies a lot of work, and all the work done in Oracle Text Manager can be accomplished through pl/sql. To use Oracle Text, you must have a ctxapp role or a ctxsys user. Oracle text provides system administrators with Ctxsys users and Ctxapp roles for application developers.

Ctxsys users can perform the following tasks: Start the Oracle Text server and perform all tasks for the Ctxapp role.
Users with the Ctxapp role can perform the following tasks: Creating indexes, Managing Oracle Text data dictionaries, including creating and deleting preferences, performing Oracle text queries, and using Oracle text PL/SQL packages.

Steps to use Oracle text:

(1) Create a table to hold some documents. The example uses a primary key column to identify each document and saves each document with a small VARCHAR2 column.

CREATE TABLE docs (ID number PRIMARY KEY, Text VACHAR2 (80));

(2) Placing two sample documents into the table:

INSERT into Docs VALUES (1, ' The ' the ' ");
INSERT into Docs VALUES (2, ' the second Doc ');
COMMIT;

(3) Use Oracle Text Manager to create and modify preferences, and preferences are associated with indexes.

(4) Use Oracle Text Manager to create a text index. Alternatively, you can enter the following SQL statement that uses the default preferences:

CREATE INDEX Doc_index on docs (text) Indextype is Ctxsys. context;

(5) Use the CONTAINS function to issue content-based document queries. For example:

SELECT ID from Docs WHERE CONTAINS (text, ' a ') > 0;

This will find all rows in the docs of the text column that contains the word, document 1. The >0 part of the statement is required for valid Oracle SQL, and Oracle SQL does not support Boolean return values for functions.

The above is just a simple example, designed to give you the complete steps for using Oracle text to establish a Full-text index, which is summed up as follows:

(1) Build the table and load the text (containing the text fields that need to be retrieved)

(2) configuration index

(3) Index creation

(4) Issuing enquiries

(5) Index maintenance: Synchronization and optimization (will be introduced later)

3 Text Load

To implement Full-text retrieval of text you must first load the correct text into a database table, the default indexing behavior requires that the document be loaded in a text column, although the document can be stored in other ways (including file system and URL) (set in the data store option). By default, the system should load the document in a text column. The text column can be VARCHAR2, CLOB, BLOB, char, or bfile. Note that storing text in the opposite column type, long and long, is supported only if the ORACLE7 system is ported to Oracle8. The column type NCLOB, date, and number cannot be indexed.

About document formatting, because the system can index most document formats, including HTML, PDFs, Microsoft Word, and plain text, you can mount any of these document types into a text column (set in the Filter option). For more information about the supported document formats, see the Appendix "Supported Filter Formats" in Oracle Text User ' s Guide and Reference.

There are several main methods of loading:

(1) SQL INSERT statement

(2) Ctxload executable file

(3) Sql*loader

(4) The Dbms_lob of the LOB is loaded from the BFILE. LoadFromFile () Pl/sql process

(5) Oracle call Interface

4 Indexing the text

After the text is loaded into a text column, you can create an Oracle text index. Documents are stored in many different scenarios, formats, and languages. Therefore, each Oracle Text index has many options that need to be set up to configure the index for specific situations. When you create an index, Oracle text uses several default values, but in most cases the user is required to configure the index by specifying a preference.

Many of the options for each index are composed of functional groups, called "classes," where each class embodies one aspect of the configuration, which can be considered to be a problem related to the document database. For example: data storage, filter, lexical analyzer, related thesaurus, storage and so on.

Each class has a number of predefined behaviors, called objects. Each object is an answer that a class problem might have, and most objects contain attributes. Attributes are used to customize the object so that the configuration of the index is more variable to accommodate different applications.

(1) Storage (Storage) class

The storage class specifies the table space parameters and the creation parameters of the database tables and indexes that make up the Oracle text index. It has only one basic object: Basic_storage, and its properties include: I_index_clause, I_table_clause, K_table_clause, N_table_clause, P_table_clause, R_ Table_clause.

(2) data storage (Datastore) class

Data store: A location and other information about the text stored in a column. By default, text is stored directly in the column, and each row in the table represents a separate, complete document. Other data storage locations include Web pages that are stored in separate files or identified by their URLs. Seven basic objects include: Default_datastore, Detail_datastore, Direct_datastore, File_datastore, Multi_column_datastore, URL_ Datastore, User_datastore,.

(3) Document segment Group (section Group) class

A document segment group is an object that specifies a set of document segments. Before you can use the index to query within a document segment by using the WITHIN operator, you must define a document segment. The document segment is defined as part of the document segment group. Contains seven basic objects: Auto_section_group, Basic_section_group, Html_section_group, News_section_group, Null_section_group, XML_ Section_group, Path_section_group.

(4) Related Thesaurus (Wordlist) class

Related Thesaurus identifies the language of the stem and fuzzy matching query options for the index, with only one basic object basic_wordlist, with attributes: Fuzzy_match, Fuzzy_numresults, Fuzzy_score, Stemmer, Substring_index, Wildcard_maxterms, Prefix_index, Prefix_max_length, Prefix_min_length.

(5) Indexed set (index set)

An index set is a collection of one or more Oracle indexes (not Oracle text indexes) that create an Oracle text index of type Ctxcat, with only one basic object Basic_index_set.

(6) Lexical Analyzer (Lexer) class

The lexical Analyzer class identifies the language used by the text, and also determines how the markup is identified in the text. The default lexical Analyzer is English or other Western European languages that identify tags with spaces, standard punctuation, and non-alphanumeric characters, while disabling case. Contains 8 basic objects: Basic_lexer, Chinese_lexer, Chinese_vgram_lexer, Japanese_lexer, Japanese_vgram_lexer, KOREAN_LEXER, KOREAN __morph_ LEXER, Multi_lexer.

(7) Filters (Filter) class

The filter determines how text is filtered to establish an index. You can use filters to index documents processed by a word processor, formatted documents, plain text, and HTML documents, including 5 basic objects: Charset_filter, Inso_filter INSO, Null_filter, Procedure_filter, User_filter.

(8) Non-indexed character table (Stoplist) class

A non-indexed character table class is used to specify a set of words that are not indexed (called non-indexed words). There are two basic objects: Basic_stoplist (all non-indexed words in one language), multi_stoplist (multiple-language non-indexed word tables that contain non-indexed words in multiple languages).

5 Query

When an index is established, a text query can be issued using the CONTAINS operator in the SELECT statement. There are two kinds of queries you can use CONTAINS: Word query and about query.

5. 1 Word Query examples

A word query is a query for an exact word or phrase that is entered between single quotes in the CONTAINS operator. In the following example, we will look for all documents in the text column that contain the word Oracle. The score for each row is selected by the SCORE operator using the label 1:

SELECT SCORE (1) title from News WHERE CONTAINS (text, ' Oracle ', 1) > 0;

In query expressions, you can use text operators such as and and or to get different results. You can also add a structural predicate to the WHERE clause. You can use COUNT (*), ctx_query. Count_hits or Ctx_query. EXPLAIN to calculate the number of hits (matches) for the query.

5. 2 about Query sample

In all languages, the about query increases the number of related documents returned by a query. In English, the about query can use the Indexed keyword component, which is created by default. In this way, the operator returns the document based on the concept of the query, not just the exact word or phrase specified. For example, the following query looks for all documents in a text column about the subject politics, rather than a document that contains only the word politics:

SELECT SCORE (1) title from News WHERE CONTAINS (text, ' About (politics) ', 1) > 0;

6 displaying documents that meet query criteria

Typically, the user can view the document returned by the query by using the Oracle text Query application. The user chooses a document in the list, and the application displays the document in some form. With Oracle Text, documents can be reproduced in different ways. For example, you can display a document by highlighting the query word. The highlighted query Word may be the word in the query of relevant words, or it may be the subject word in English about query.

The following are information about the output effect and the procedure for each output effect:

Highlighted document, plain Text format version (Ctx_doc. MARKUP)

Highlighted document, HTML version (Ctx_doc. MARKUP)

Highlight the offset information for the plain Text format version (Ctx_doc. Highlight)

Highlight the offset information for the HTML version (Ctx_doc. Highlight)

Plain text Format version, no highlighting (Ctx_doc. FILTER)

HTML version document, no highlighting (Ctx_doc. FILTER)

7 Index maintenance

When the index is built, what if the data in the table changes, such as adding or modifying the record? Because indexes are not automatically modified for any DML statements that occur on a table, you must synchronize (sync) and optimize (optimize) indexes to correctly reflect the changes in the data.
After the index is built, you can find out from the user that Oracle automatically produces the following tables: (assuming the index is named Myindex):
Dr$myindex$i,dr$myindex$k,dr$myindex$r,dr$myindex$n

The most important of which is the I table, you can query the table:

Select Token_text, token_count from dr$ myindex $I where rownum<=20;

The query results are omitted here. As you can see, this table is actually saved by Oracle analysis of your documents, the resulting term records here, including the location of the term, the number of times, hash value, and so on. When the content of the document changes, you can imagine that the contents of the I table should also be changed to ensure that Oracle in the Full-text search is correctly retrieved content (because the so-called full-text search, in fact, the core is to query this table). Then how to maintain the contents of the table, not every time the data changes are indexed, this will use sync and optimize.

Sync: Saves the new term to the I table;

Optimization (optimize): Clears the garbage of I table, mainly deletes the term that has been deleted from the I table.

Oracle provides a so-called CTX server to do this synchronization and optimization work, only to run the process in the background, it will monitor the changes in the data, timely synchronization. Alternatively, you can do this by using the following job (the job is built under the same user as the table):

Create or Replace procedure Sync
Is
Begin
Execute immediate
' ALTER index MYINDEX rebuild Online ' | |
' Parameters (' sync ') ';
Execute immediate
' ALTER index MYINDEX rebuild Online ' | |
' Parameters (' optimize full maxtime unlimited ') ';
End sync;
/

Set Serveroutput on
Declare
V_job number;
Begin
Dbms_job.submit
(
Job => V_job,
what => ' sync; ',
Next_date => sysdate, * default * *
Interval => ' sysdate + 1/720 ' * = 1 day/(hrs * min) = 2 mins * *
);
Dbms_job.run (V_job);
Dbms_output.put_line (' submitted as Job # ' | | | to_char (v_job) ');
End
/
The Sysdate + (1/720) of the job is synchronized every 2 minutes. The specific time interval, may according to own application the need to decide.

8 Summary

Text for all sizes of companies, organizations, are the most effective carrier containing a wealth of information, the introduction of Oracle text, marking Oracle provides a new set of technologies, can be convenient and safe for the management of enterprise text information. Oracle text enables application developers to transparently add Full-text search capabilities to sql-based applications, and Oracle text is a core component of other Oracle products, such as Oracle9iAS portal,oracle ebusiness Suite,oracle Ultra Search and Oracle Internet File system. Using the Full-text search technology provided by Oracle text flexibly, it can make its application have powerful full-text searching ability.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.