Due to work needs, I set Oracle Intermedia on HP UX And sortium to implement full-text search. It is currently in use. There are many problems and experiences in the setup process. This article is based on Oracle 8.1.6 and 8.1.7, and cannot be applied to other versions.
Currently, full-text retrieval is supported by almost all mainstream databases. Previously, I implemented it on SQL server 2000, which is very simple and convenient. However, it usually takes more than a dozen hours to create a full-text search index. The creation and maintenance of full-text search in Oracle is much faster. It takes only 20 minutes to create an index for a table with 0.65 million records, and only 1 minute for synchronization. But the setting is much more complicated.
I. setup process
1. First, check whether intermedia is installed in your database:
This can be done by checking whether there are ctxsys users and ctxapp roles (role). If you do not have this user and role, it means that the intermedia function is not installed when your database is created. You must modify the database to install this function.
Modification Process:
Run $ ORACLE_HOME/bin/dbassist, select 'modify database', and select both j server and intermedia when selecting the database function (jserver must be installed at the same time when intermedia is installed ). we strongly recommend that you back up the entire database before making this change.
2. Set extproc:
Oracle implements intermedia through the so-called 'external call function' (external procedure). Therefore, setting extproc correctly is a key step.
First, you must configure listener to enable it to listen to requests called by intermedia. You can run $ ORACLE_HOME/bin/netassit to configure the listener. You can also manually modify the configuration file $ ORACLE_HOME/network/admin/listener. ora and restart listener. The following example shows how to manually modify the configuration file.
Open the listener. ora file. Before modification, there are usually the following content (assuming the default listener is used ):
LISTENER =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = MYDATABASE)(PORT = 1521)))SID_LIST_LISTENER =(SID_DESC =(GLOBAL_DBNAME = mydatabase.world)(ORACLE_HOME = /u01/app/oracle/product/8.1.6)(SID_NAME = mydatabase))
|
Extproc has not been configured for this listener. Therefore, you need to add a listener to extproc by adding description and sid_desc respectively. The modified listner. ora is as follows:
CENTER>
LISTENER =(DESCRIPTION_LIST =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = MYDATABASE)(PORT = 1521)))(DESCRIPTION =(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC))))SID_LIST_LISTENER =(SID_LIST =(SID_DESC =(GLOBAL_DBNAME = mydatabase.world)(ORACLE_HOME = /u01/app/oracle/product/8.1.6)(SID_NAME = mydatabase))(SID_DESC =(PROGRAM = extproc)(SID_NAME = PLSExtProc)(ORACLE_HOME = /u01/app/oracle/product/8.1.6)))
|
Note that the above host, global_dbname, sid_name, oracle_home should fill in the actual value of your database, but the actual value of program must be extproc.
Configure the tnsnames. ora file on the server. The file is located under $ ORACLE_HOME/network/admin. You can also configure it by running netasst.
Add the following to the tnsnames. ora file:
EXTPROC_CONNECTION_DATA,EXTPROC_CONNECTION_DATA.WORLD =(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC)))(CONNECT_DATA =(SID = PLSExtProc)))
|
Note that the KEY and SID must be the same as the key and sid_name in listener. ora.
After the modification, restart listener (stop WITH lsnrctl and then lsnrctl start). Then, use tnsping to check whether the configuration is correct:
Tnsping extproc_connection_data or tnsping extproc_connection_data.world:
Attempting to contact (ADDRESS = (PROTOCOL = IPC) (KEY = EXTPROC) OK (140 ms)
Otherwise, check your two files and note that you must restart listener after modification, but you do not need to restart the database.
3. Set the lexical analyzer (lexer ):
The full-text retrieval mechanism of Oracle is actually very simple. The Oracle patented lexical analyzer (lexer) is used to find all ideographic units (Oracle called term) in the article and record them in a group of tables starting with dr $, at the same time, write down the location, number of times, and hash value of the term. During retrieval, Oracle searches for the corresponding term from this table and calculates the frequency of occurrence. Based on an algorithm, it calculates the score (score) of each document, which is called the 'matching rate '. Lexer is the core of this mechanism, which determines the efficiency of full-text retrieval. Oracle provides different lexer for different languages, and we can usually use three of them:
Basic_lexer: for English. It can separate English words from sentences based on Spaces and punctuations, and automatically treat words that have lost retrieval meaning frequently as 'spam ', such as if, is and so on, with high processing efficiency. However, the lexer has many problems when used in Chinese. Because it only recognizes space and punctuation, and generally does not contain spaces in a Chinese sentence, it regards the entire sentence as a term, in fact, the retrieval capability is lost. Taking the phrase 'Chinese people stood up' as an example, the result of the basic_lexer analysis is only one term, that is, 'Chinese people stood up '. If 'China' is retrieved, NO content is retrieved.
Chinese_vgram_lexer: A specialized Chinese analyzer that supports all Chinese character sets. The analyzer analyzes Chinese sentences in units of words. The Chinese people stood up. This sentence will be analyzed into the following terms: medium, Chinese, Chinese, people, and people ', 'stand up ', get up', 'Come '. It can be seen that this analysis method is easy to implement and can achieve 'all-in-One nets', but the efficiency is unsatisfactory.
Chinese_lexer: this is a new Chinese analyzer that only supports the utf8 character set. As we can see above, the analyzer of chinese vgram lexer does not know commonly used chinese words, so the analysis unit is very mechanical, like the above 'people station ', the term "Start Up" does not appear separately in Chinese. Therefore, this term is meaningless and affects efficiency. The biggest improvement of chinese_lexer is that the analyzer can recognize most of the commonly used Chinese vocabulary, so it can analyze sentences more efficiently. The above two stupid units will not appear again, greatly improving the efficiency. However, it only supports utf8. If your database is in the zhs16gbk character set, you can only use the stupid Chinese vgram lexer.
If no settings are made, Oracle uses the basic_lexer analyzer by default. To specify which lexer to use, perform the following operations:
1. Create a preference under the ctxsys User: begin ctx_ddl.create_preference ('My _ lexer ', 'Chinese _ vgram_lexer ').
2. Specify the lexer used when creating an intermedia index:
create index myindex on mytable(mycolumn) indextype is ctxsys.context parameters('lexer my_lexer');
|
In this way, chinese_vgram_lexer is used as the analyzer.
4. Use job timing synchronization and optimization:
After the intermedia index is created, what if the data in the table changes, such as adding or modifying records? Because no dml statement occurs on the table, the index is not automatically modified. Therefore, you must regularly synchronize (sync) and optimize (optimize) indexes to correctly reflect data changes.
After the index is created, we can check that Oracle automatically generates the following tables under this user: (assuming the index name is myindex): DR $ myindex $ I, DR $ myindex $ K, DR $ myindex $ R, DR $ myindex $ N, among which I table is the most important. You can query this table to see what it contains:
Select token_text, token_count from DR $ I _RSK1 $ I where rownum <= 20;
The query is not listed here. As you can see, the table stores the term records generated after Oracle analyzes your documents, including the location, number of times, and hash value of the term. When the content of the document changes, you can imagine that the content of this I table should also change accordingly to ensure that the content is correctly retrieved by Oracle during full-text retrieval (because of the so-called full-text retrieval, in fact, the core is to query this table ). So how to maintain the table content? You cannot re-create an index for every data change! This requires sync and optimize.
1. sync: Save the new term to the I table;
2. optimize: clears the garbage of the I table, mainly to delete the deleted term from the I table.
Oracle provides a so-called ctx server for synchronization and optimization. You only need to run the process in the background. It monitors data changes and synchronizes data in a timely manner. However, I encountered many problems when using ctxserver. Oracle Beijing support is also recommended not to use, but to use the following two jobs (the job should be created under the same user as the table ):
-- sync:VARIABLE jobno number;BEGINDBMS_JOB.SUBMIT(:jobno,'ctx_ddl.sync_index(''myindex'');',SYSDATE, 'SYSDATE + (1/24/4)');commit;END;-- optimizerVARIABLE jobno number;BEGINDBMS_JOB.SUBMIT(:jobno,'ctx_ddl.optimize_index(''myindex'',''FULL'');',SYSDATE, 'SYSDATE + 1');commit;END;
|
The SYSDATE + (1/24/4) of the first job is synchronized every 15 minutes. The SYSDATE + 1 of the second job is fully optimized every one day. The specific time interval can be determined based on your application needs. So far, your full-text search function has been set up.
| [Content navigation] |
| Page 1: setup process |
Page 2nd: Common Errors |