Scope of survey plan The purpose of this survey is to find a knowledge base system that provides excellent full-text search functions and performance. During the survey, we will give priority to how to implement the full-text search function, consider the functions, performance, and efficiency of building a complete system. There are a variety of effective solutions for building a full-text search knowledge base system, mainly including the following three Implementation ideas:
Solution Use a database that supports full-text retrieval to build a large text management database system and design a Knowledge Base System Based on the file management database;
Solution B The full text search engine is used to implement the full text search function, so as to design the knowledge base system;
Solution C Use wiki to build a knowledge base system. The following sections describe.
1
Full-text Retrieval Database Solution
1.1 Oracle Text
Introduction Databases that support full-text retrieval perform efficient text searches on large text systems stored in database tables by creating and maintaining indexes. In the knowledge base system, full-text search objects mainly express knowledge information. Article A typical search application may be to find all the articles with a title or content containing a search item. Store articles in the knowledge base in a database table created based on text fields such as the title and content, so that full-text search in the knowledge base system is equivalent to full-text search in the database. The full-text retrieval Component of Oracle is called Oracle Text in Oracle9i. The architecture of Oracle text is as follows: In the scope of this survey, the center of this architecture is to index the title, content, and other fields of the articles stored in the database.
Figure 1 Oracle Text Architecture
1.2 demo
-Use OracleFull-text retrieval of Chinese characters using databases Objective: To confirm that Oracle Text supports full-text Chinese search, summarize the operation process, and explore better performance. (Demo is based on oracle9.02 Personal Edition. The Enterprise Edition and Standard Edition should get the same result .) The main process is described here:
Step 1 ensure that some configuration items of the database (1) check whether the Oracle Text Component is installed in the database (Oracle9i is installed by default) and whether the database has the ctxsys user and ctxapp roles. If not, run $ ORACLE_HOME/bin/dbassist, select 'modify database' and select both jserver and intermedia when selecting the database function. (2) check whether the server has listened to the plsextproc service and executed LSNRCTL status in the Windows Shell Command window, generally, plsextproc has 1 service handler (s) indicates that the external call function is enabled. Otherwise, you need to modify $ ORACLE_HOME/Network/admin/listener. ora to enable this function;
Step 2 create a database knowledge and create a ctxtest user (1) Create Database knodge DGE (2) create user ctxtest identified by ctxtest under database knodge; (3) Assign the ctxtest user connect, ctxapp, resource role grant connect, ctxapp, resource to ctxtest;
Step 3 design the database and create a table (1) Use powerdesigner to create a general concept model for the knowledge base system. Export the physical view and generate an SQL script to generate a table in the database instance knodge DGE; in the demo, you need to use the article and version tables. These two tables represent the document and each version of the article, and assume that the content, author, and attachment of each version can be changed, however, the title of an article cannot be modified.
Figure 2 conceptual model of Knowledge Base System
(2) Insert some records into the created table, including insert into version (ID, content) values (1, 'Good morning '); insert into version (ID, content) values (2, '<title> Good morning </title>'); insert into version (ID, content) values (3, 'I am a Chinese, I love my motherland and people deeply. I am the son of the people ');
Step 4 Create an index Connect to the knowledge database as a ctxtest user (1) Create preference begin tx_ddl.create_preference ('My _ lexer ', 'Chinese _ vgram_lexer'); end; (2) create index myindex on version (content) indextype is ctxsys. context
Parameters ('lexer my_lexer '); The above command indexes the content field of the version table using the Chinese word divider chinese_vgram_lexer provided by Oracle. The created indexes are stored in the default temporary tablespace of the ctxsys user. The four tables whose names start with "Dr $" are as follows: dr $ myindex $ I, Dr $ myindex $ K, Dr $ myindex $ R, Dr $ myindex $ N, among which table Dr $ myindex $ I is the most important. Query the select token_text, token_count from Dr $ myindex $ I; It can be found that the table stores all the index items obtained by analyzing the content field of the article table.
Step 5. Execute the query Select * from version where contains (content, 'China')> 0; you can find the third record in the inserted Table.
Step 6 create a job to maintain the index The index was created after the statement of creating the index was executed just now. However, any addition, deletion, and modification operations performed on the database afterwards will not cause simultaneous Index Update. Therefore, you need to create a job with the synchronous update operation to execute the synchronous update operation. On the other hand, after multiple synchronous updates are executed, the index area will cause index dilution and will inevitably require optimization, therefore, you also need to create a job for the database to regularly optimize the index structure. the two jobs can be created as follows: Synchronous Sync:
Variable jobno number;
Begin dbms_job.submit (: jobno, 'ctx _ DDL. sync_index ('myindex'); ', sysdate, 'sysdate + (1/24/4 )');
Commit; end; optimized optimizer:
Variable jobno number;
Begin dbms_job.submit (: jobno, 'ctx _ DDL. optimize_index ('myindex', 'full'); ', sysdate, 'sysdate + 1 ');
Commit; end; The sysdate + (1/24/4) of the first job is synchronized every 15 minutes. The sysdate + 1 of the second job is fully optimized every one day. The specific time interval can be determined by application requirements. At this point, the full-text search function has been set.
1.3
Two Problems
1.3.1Oracle TextFor HTMLSupport for index creation The knowledge base system generally uses an online editor to edit an article. The WYSIWYG text usually uses HTML syntax. A large number of markup symbols in the HTML source file should not be involved in matching the query string, this is because not only the format tags such as <title> should not be searched for such as "title", but if no filter is used to create index conditions for these parts, for a large text query system, the efficiency will be greatly affected. Review the architecture diagram of Oracle Text, oracle Text has fully considered HTML, XML, and other formatting-supported documents when creating an index for text word segmentation. The format scripts without ideographic content have been filtered out during index creation. This can be achieved by setting the Section groups group when creating an index. SQL statement: Begin ctx_ddl.create_section_group ('My _ section_group ', 'Basic _ section_group'); ctx_ddl.add_field_section (group_name => 'my _ section_group ', section_name => 'title ', tag => 'title', visible => false); end;/drop index my_html_idx; Create index my_html_idx on version (context) indextype is ctxsys. contextparameters ('section group my_section_group ')/
Select ID from version where contains (content, 'Good morning within title')> 0;
1.3.2Chinese Word SegmentationProgramSettings Improvement When creating lexer preferences for an engine, the tokenizer set is the chinese_vgram_lexer provided by Oracle. In fact, Oracle also provides a chinese_lexer. In comparison, this tokenizer Word Segmentation Algorithm More intelligent and more efficient. In the demo, when an index is created for the content field of the third record in the table, the number of index items is reduced from 53 of chinese_vgram_lxer to 31. Mentioned in the official document Chinese_lexer Only supported UTF-8 Character Set database, Demo During the process Zhs16gbk It can also be used.
(Unfinished, to be continued)