Oracle Text is a powerful search technology built into all versions of the Oracle database, including the free quick edition (XE. Its development APIs enable software developers to easily implement Content Search applications with complete functions.
Oracle Text can be used to search structured and unstructured documents. It is a supplement to matching SQL wildcards. Oracle Text supports combining multiple search entries using basic boolean operators (AND, OR, NOT, NEAR, AND so on). In addition, it also provides more advanced functions, such as soundex, fuzzy search, and result sorting. This technology supports hundreds of file types, including Microsoft Office and PDF. Oracle Text is suitable for a variety of search-related usage and storage structures. Text applications include e-commerce, document and record management, and problem tracking. Searchable text can reside in a database in a structured manner, or in a local file system or on the Web.
Oracle Text provides a complete SQL-based search API that includes custom query operators, DDL syntax extensions, a set of PL/SQL processes, and database views. Through the Text API, application developers can have full control over indexes, queries, security, demonstrations, and sometimes required software configurations, this is especially true when developing out-of-the-box and out-of-use non-custom software. Through out-of-the-box software products, you want to make the software configuration as simple as possible, even if this means that you need to do more preliminary work in product development. Reducing the complexity of applications usually results in the later stages of the product lifecycle, especially in the support, maintenance, and future product development stages.
Oracle Text also supports document-level authorization, and document-level authorization is usually difficult to maintain high performance while being unified. With the help of Text, hybrid query of link data and non-structured data is also well supported. For authorization, this means that you can combine full-text search and authorization into a query. The number of filtering phases required for independent result sets and final results can be minimized, simplifying application development. Oracle Text frees application developers from tedious development and allows them to focus on performance optimization.
Oracle Text is also an unknown programming language, and it can also be used for PHP and Java applications.
Some time ago, I needed to improve the search function of the Enterprise Content Management (ECM) system. I first evaluated the use of Oracle Text. The evaluation proves that Oracle Text is a very feasible technology for building application search: it has advanced search function, supports a large number of different file types, can be highly customized, and highly scalable at the same time. One disadvantage of the original search technology is that you need to search the file content outside the database, then run the database metadata search, authorize the results, and finally merge independent result sets. With Oracle Text, all these operations can be performed in the database. The ECM system uses Oracle databases to store metadata. Because this technology has been released, the customer will naturally choose to use it, and it will not add any cost to the customer.
A Simple Method for performing a free text search query in a database is similar:
SELECT * FROM issues
Where lower (author) LIKE % word1 % and lower (author) LIKE % word2 %...
In this way, each column needs to be matched with each keyword separately. In each column, you can match the keyword in any order. However, the design of relational databases does not allow it to execute queries as efficiently as above, and using this method produces extremely unscalable applications. Of course, you can design your own indexing and search solutions, but you may not optimize the use of your resources, this is especially true if you have already paid for the search technology as part of the database.
This article discusses how to use Oracle Text in virtual problem tracking applications. In this application, you can create issues that contain metadata and optional additional files. This application uses Oracle Text to implement full-Text search of metadata and optional additional file content.
The examples provided here have been tested on Oracle XE in Linux. These examples should also run well on other Oracle platforms.
Indexing Process and search
After Oracle Text is indexed as a searchable data item, you can search for the content. Indexing is a common method to ensure search performance. The indexing process of Oracle Text is modeled based on the pipeline. In this pipeline, keywords of data items retrieved from the data storage are added to the index after a series of transformations. The indexing process is divided into multiple stages. Each stage is processed by a separate entity and can be configured by application developers.
Oracle Text has different index types suitable for different purposes. For full-text search of large documents, the CONTEXT index type is suitable. The indexing process includes the following phases:
◆ Data retrieval: Only data is retrieved from data storage (such as Web pages, large database objects, or local file systems) and then transmitted as data streams to the next stage.
◆ Filter: the filter converts data in various file formats to plain text formats. Other components in the indexing pipeline can only process plain text data and cannot recognize file formats such as Microsoft Word or Excel.
◆ Segmentation: adds metadata about the structure of the original data item to the field splitter.
◆ Lexical analysis: the NLP stream is divided into several words based on the language of the data item.
◆ Index: add keywords to the actual index in the last phase.
After the index is built, the application can run the search entered by the end user through common SQL queries.
Install Oracle Text
By default, Oracle Text is installed with Oracle database XE. To use other database versions, you must install the Oracle Text function on your own. After this function is installed, you only need to create a common database user and assign the user the CTXAPP role. In this way, you can execute a specific index management process:
Create user ot1 identified by ot1; GRANT connect, resource, ctxapp TO ot1;
File index
Here, you need to create a text table to index the content of additional files stored in the problem tracking system. The attached files are stored in the file system. In addition to the columns required for the data model of the application, the text base table also contains an absolute file path and a format column.
Create table files (id number primary key, issue_id NUMBER, path VARCHAR (255) UNIQUE, ot_format VARCHAR (6); insert into files VALUES (1, 1, /tmp/oracletext/found1.txt, NULL); insert into files VALUES (2, 2,/tmp/oracletext/found2.doc, NULL); insert into files VALUES (3, 2, /tmp/oracletext/notfound.txt, IGNORE );
Here, the ot_format value is parsed by Oracle Text during the indexing process. NULL indicates that the system automatically selects a filter for the file. If the value is IGNORE, the system skips the entire file.
You can use the following statement to create a text index:
Create index file_index ON files (path) indextype is ctxsys. context PARAMETERS (datastore ctxsys. file_datastore format column ot_format );
This statement starts the indexing process. The indexing process searches for files in the file system by the path stored in the base table, filters the content, and creates an index. In this way, a CONTEXT index is created, which is case sensitive and has exact matching semantics. The indexing process can be customized in multiple ways, such as prefix and suffix matching.
Although you do not need to specify the file format for each file during most of the time filtering phase, adding this column to the base table can further control the indexing process. For example, if you use a format column, you can skip some file types and do not create an index for it. This column is useful when you only want to officially support partial file formats supported by Oracle Text in the application.