Gain more value from unstructured information. Study how a simple text mining application uses the UIMA SDK to build a text analysis engine to look for names in a document. Another UIMA component then writes the result to a table in the db2® database. This data is then used to use DB2 intelligent Miner to find strong associations between people who are often mentioned in the document.
Brief introduction
There is a growing desire to use information technology to derive greater value from unstructured information in the organization. IBM recently introduced a new unstructured information Management Architecture (UIMA) framework (see Resources), which simplifies the development and deployment of systems for analyzing unstructured media objects, such as documents, Can be used to provide functionality such as semantic search and text mining. Text mining is a data mining technique used to extract information from text. Next, a very simple text mining application is described in detail.
Overview
The text mining application described in this article is called Preston, which analyzes the document, looks for the names mentioned, and uses text mining to find people who are often referred to at the same time. Although this technique is only one of the many useful text mining techniques, it demonstrates the main features of such applications and provides a concrete example of how UIMA is used. It also demonstrates how to combine structured databases and text mining. This article is about people who want to learn how to use the new UIMA technology to connect unstructured and structured information.
Figure 1 gives an overview of the Preston. This program analyzes documents that are stored as text fields in DB2 database tables. Components in the UIMA framework read and analyze documents from the database, look for names mentioned in some format, and then write the results to another database extracted information db (EIDB). These components are developed and deployed using the tools in the UIMA SDK, and the UIMA SDK can be obtained from developerWorks (see Resources). The information in the EIDB should be analyzed and processed to prepare for text mining, which is done using DB2 intelligent Miner. The entire application can be easily run on a portable computer.
Figure 1. Overview of the Preston text mining application described in this article
The document used as an example in this article is biographical information for actors and other people from the Internet Movie Database (see Resources). IMDB For illustration purposes, I built a DB2 structured database using a subset of the IMDB content to keep these biographical information in the database as a text field.