Oracle has been committed to the research of full-text retrieval technology. When Oracle9i rlease2 was released, the full-text retrieval technology of Oracle databases was already perfect, oracle Text enables Oracle9i to provide powerful text retrieval capabilities and Intelligent Text management capabilities. Oracle Text is the new name used by Oracle9i. In oracle8/8i, it is called Oracle intermedia text. Before oracle8, its name was Oracle context cartridge. With Oracle9i and Oracle Text, you can easily and effectively use standard SQL tools to build new text-based development tools or expand existing applications. Application developers can make full use of Oracle Text Search in any Oracle database application that uses text. the application scope can be the comment fields that can be searched in existing applications, but it also implements a large document management system that involves a variety of document formats and complex search standards. Oracle Text supports basic full-text search in most languages supported by Oracle databases. This article describes how to use the full-text retrieval technology of Oracle9i to provide an excellent solution for your applications.
1 architecture of Oracle Text
Is the architecture of Oracle Text.
Figure 1 architecture of Oracle Text
Based on the architecture diagram above, the main logical steps used in Oracle Text Index documents are as follows:
(1) store data to search all rows in a table and read the data in the column. Generally, this is only column data, but some data storage uses column data as a pointer to document data. For example, url_datastore uses column data as a URL.
(2) The filter extracts document data and converts it to text representation. This is required when storing binary files (such as Word or Acrobat files. The output of the filter does not need to be in plain text format-it can be in text format such as XML or HTML.
(3) The Field splitter extracts the output information of the filter and converts it to plain text. Different text formats, including XML and HTML, have different slots. Converting to plain text involves detecting important document field tags, removing invisible information, and re-formatting the text.
(4) the lexical analyzer extracts plain text from the field splitter and splits it into discontinuous tags. There are both lexical analyzers used by blank character separation languages and specialized lexical analyzers used by complicated segments in Asian languages.
(5) The index engine extracts all the tokens in the lexical analyzer, the offset of the document segment in the field splitter, and the list of low-information-content words called non-indexed words, and builds reverse indexes. Inverted indexes store tags and documents containing these tags.
2. Simple Example
Here is a simple example to illustrate how to use Oracle Text to implement full-text search. Orcale9i provides Oracle Text manager to simplify a lot of work. All the work done in Oracle Text manager can be done through PL/SQL. To use Oracle Text, you must have the ctxapp role or ctxsys user. Oracle Text provides the system administrator with the ctxsys user and the ctxapp role for application developers.
The ctxsys user can execute the following tasks: Start the Oracle Text server and execute all tasks of the ctxapp role.
Users with the ctxapp role can perform the following tasks: Create an index, manage the Oracle Text Data Dictionary, including creating and deleting preferences, query Oracle Text, and use the Oracle Text PL/SQL package.
Steps for using Oracle Text:
(1) create a table to save some documents. This example uses a primary keyword column to identify each document and a small varchar2 column to save each document.
Create Table Docs (ID number primary key, text vachar2 (80 ));
(2) place two sample documents in the table:
Insert into docs values (1, 'the first doc ');
Insert into docs values (2, 'The second Doc ');
Commit;
(3) Use Oracle Text manager to create and modify preferences. preferences are associated with indexes.
(4) use Oracle Text manager to create text indexes. In addition, you can enter the following SQL statement that uses the default preference:
Create index doc_index on Docs (text) indextype is ctxsys. context;
(5) use the contains function to issue Content-based document queries. For example:
Select ID from docs where contains (text, 'First ')> 0;
This will find all rows in the docs that contain the word first (that is, document 1) in the text column. More than 0 of the statements are required for valid Oracle SQL statements. Oracle SQL does not support Boolean return values of functions.
The above is just a simple example. It aims to provide the complete steps to create a full-text index using Oracle Text, which are summarized as follows:
(1) create a table and load text (including text fields to be retrieved)
(2) Configure Indexes
(3) create an index
(4) issue a query
(5) index maintenance: synchronization and optimization (will be introduced later)
3. Text Loading
To achieve full-text search, you must first load the correct text into the database table. By default, the index creation requires that the document be loaded in the text column, although you can store documents in other ways (including file systems and URLs) (set in the "Data Storage" option ). By default, the system should load the document in the text column. The text column can be varchar2, clob, blob, Char, or bfile. Note that text can be stored only when the oracle7 system is migrated to oracle8 using the long and long raw columns. You cannot create an index for column nclob, date, or number.
For document formats, the system can index most document formats, including HTML, PDF, Microsoft Word, and plain text, any document type can be loaded into the text column (set in the "filter" option ). For more information about supported document formats, see Appendix "supported filter Formats" in the Oracle Text User's Guide and reference ".
The load methods include the following:
(1) SQL insert statement
(2) ctxload Executable File
(3) SQL * Loader
(4) Load Lob's dbms_lob.loadfromfile () PL/SQL process from bfile
(5) Oracle call interface
4. Create an index for text
After the text is loaded into the text column, you can create an Oracle Text Index. Documents are stored in many different solutions, formats, and languages. Therefore, each Oracle Text Index has many options that need to be set to configure indexes for specific situations. When creating an index, Oracle Text can use several default values, but in most cases, you are required to configure the index by specifying the preference.
Many options of each index constitute a functional group called "class". Each class embodies one aspect of the Configuration. You can think of these classes as problems related to the document database. For example, data storage, filters, lexical analyzers, related word lists, and storage.
Each class has many predefined behaviors, called objects. Each object is a possible answer to a class question, and most objects contain attributes. You can customize objects through attributes to make the index configuration more variable to adapt to different applications.
(1) Storage Class
The storage class specifies the tablespace parameters and creation parameters for the database tables and indexes that constitute the Oracle Text Index. It has only one basic object: basic_storage. Its attributes include: I _index_clause, I _table_clause, k_table_clause, n_table_clause, p_table_clause, and r_table_clause.
(2) Data Storage (datastore)
Data storage: Location of text stored in columns and other information. By default, text is stored directly in the column, and each row in the table represents a separate complete document. Other data storage locations include those stored in individual files or web pages identified by their URLs. Seven basic objects include default_datastore, detail_datastore, direct_datastore, file_datastore, multi_column_datastore, url_datastore, and user_datastore ,.
(3) Section group class
A document field group is an object used to specify a group of documents. You must first define the document segment before using the index to query within the document segment through the within operator. The document segment is defined as part of the document segment group. Contains seven basic objects: auto_section_group, basic_section_group, html_section_group, news_section_group, null_section_group, xml_section_group, and path_section_group.
(4) wordlist class
The related Word Table identifies the language of the stem and fuzzy match query options used for indexing. There is only one basic object basic_wordlist, and its attributes include: fuzzy_match, fuzzy_numresults, fuzzy_score, Stemmer, substring_index, wildcard_maxterms, prefix_index, prefix_max_length, and prefix_min_length.
(5) index set)
An index is a collection of one or more Oracle indexes (not an Oracle Text Index). It is used to create an Oracle Text Index of the ctxcat type and has only one basic object, basic_index_set.
(6) lexer class
The Lexical analyzer class identifies the language used by the text and also determines how to mark the text. The default lexical analyzer is used in English or other Western European languages. It is marked with spaces, standard punctuation marks, and non-alphanumeric characters, and Case sensitivity is disabled. Contains eight basic objects: basic_lexer, chinese_lexer, chinese_vgram_lexer, japanese_lexer, japanese_vgram_lexer, korean_lexer, korean1_morph _ lexer, multi_lexer.
(7) filter class
The filter determines how to filter text to create an index. You can use filters to index documents processed by the text processor, formatted documents, plain text, and HTML documents, including five basic objects: charset_filter, inso_filter Inso, null_filter, procedure_filter, and user_filter.
(8) Non-index Word Table (Stoplist) Class
A non-indexed Word Table class is used to specify a group of words that are not indexed (referred to as non-indexed words ). There are two basic objects: basic_stoplist (all non-indexed words in one language) and multi_stoplist (multi-language non-indexed Word tables that contain non-indexed words in multiple languages ).
5. Query
After an index is created, you can use the contains operator in the SELECT statement to issue a text query. Contains can be used for two types of queries: Word query and about query.
5.1 word query example
Word query is a query of exact words or phrases between single quotes in the input to the contains operator. In the following example, all documents with the Oracle word in the text column are searched. The scores of each row are selected by the score operator of tag 1:
Select score (1) title from news where contains (text, 'oracle ', 1)> 0;
In a query expression, you can use text operators such as and or to obtain different results. You can also add structural predicates to the WHERE clause. You can use count (*), ctx_query.count_hits, or ctx_query.explain to calculate the number of hit (matching) queries.
5.2 about query example
In all languages, about queries increase the number of related documents returned by a query. In English, about queries can use the indexed key word component, which is created by default. In this way, the operator returns a document based on the query concept, rather than just the specified exact word or phrase. For example, the following query searches for all documents about the topic politics in the text column, instead of documents containing only the word "Politics:
Select score (1) title from news where contains (text, 'about (politics) ', 1)> 0;
6. Show the documents meeting the query Conditions
Generally, you can query the returned documents by using Oracle Text. The user selects a document from the hit list and the application displays the document in some form. Oracle Text allows you to reproduce documents in different ways. For example, you can highlight the query word to display the document. The highlighted query word can be a word in the related word query or a key word in the English about query.
The following is information about the output effect and the process used for each output effect:
Highlighted document, in plain text format (ctx_doc.markup)
Highlighted document, HTML Version (ctx_doc.markup)
Highlight the offset information of the plain text version (ctx_doc.highlight)
Highlight the offset information of the HTML Version (ctx_doc.highlight)
Plain text version, not highlighted (ctx_doc.filter)
Html Version document, not highlighted (ctx_doc.filter)
7. Index Maintenance
After the index is created, what if the table data changes, such as adding or modifying records? Because no DML statement occurs on the table, the index is not automatically modified. Therefore, you must regularly synchronize (Sync) and optimize (optimize) indexes to correctly reflect data changes.
After the index is created, you can check that Oracle automatically generates the following tables under this user: (assuming the index name is myindex ):
Dr $ myindex $ I, Dr $ myindex $ K, Dr $ myindex $ R, Dr $ myindex $ n
The I table is the most important. You can query this table:
Select token_text, token_count from Dr $ myindex $ I where rownum <= 20;
The query result is omitted here. As you can see, the table stores the term records generated after Oracle analyzes your documents, including the location, number of times, and hash value of the term. When the content of the document changes, you can imagine that the content of this I table should also change accordingly to ensure that the content is correctly retrieved by Oracle during full-text retrieval (because of the so-called full-text retrieval, in fact, the core is to query this table ). So how to maintain the content of the table, you cannot re-create the index for every data change, which requires sync and optimize.
Sync: Save the new term to the I table;
Optimize: clears the garbage from the I table, mainly to delete the deleted term from the I table.
Oracle provides a so-called CTX server for synchronization and optimization. You only need to run the process in the background. It monitors data changes and synchronizes data in a timely manner. In addition, you can also use the following job (the job should be created under the same user as the table ):
Create or replace procedure sync Is Begin Execute immediate 'Alter index myindex rebuild online' | 'Parameters (''sync '')'; Execute immediate 'Alter index myindex rebuild online' | 'Parameters (''optimize full maxtime unlimited '')'; End sync; /Set serveroutput on Declare V_job number; Begin Dbms_job.submit ( Job => v_job, What => 'sync ;', Next_date => sysdate,/* default */ Interval => 'sysdate + 100'/* = 1 day/(24 hrs * 30 min) = 2 mins */ ); Dbms_job.run (v_job ); Dbms_output.put_line ('submitted as job # '| to_char (v_job )); End; / |
Job sysdate + (1/720) means to synchronize every 2 minutes. The specific time interval can be determined based on your application needs.
8. Summary
Text is the most effective carrier for companies and organizations of all sizes that contain a wealth of information. The launch of Oracle Text marks a new set of technologies provided by Oracle, you can conveniently and securely manage enterprise text information. Oracle Text allows application developers to transparently add full-text retrieval capabilities to SQL-based applications. Oracle Text is also the core component of other Oracle products, such as Oracle9iAS portal, Oracle eBusiness suite, oracle Ultra Search and Oracle Internet file system. By using the full-text retrieval technology provided by Oracle Text, your applications can have powerful full-text retrieval capabilities.