Issues in Data Mining under several different storage formats)

Source: Internet
Author: User

In principle, data mining can be applied to knowledge mining in any information storage mode. However, the challenges and technologies of data mining vary with the Storage types of source data. In particular, recent studies show that Data Mining involves more and more data storage types, in addition to some common value models, architectures, and other research, we have also carried out some research on mining technologies or algorithms for complex or new data storage methods. This section describes some data mining issues in the main data storage types.

A transaction database collects transaction data. In 1993, when Agrawal began to discuss data mining issues, market basket analysis was used as the background for commercial applications. At this time, the database to be mined is the product record that the customer puts into the shopping basket. The goal of the database is to discover the association between the customer's purchased products to guide the business decision-making. For this reason, some people translate the transactional database into a transaction database. It seems that this understanding has its own limitations. In fact, the mining of transactional databases has not only been directly applied to commercial activities such as procurement, sales, and market research, but also has become a general framework for solving the problem. For example, we can organize users' access to a database or website into a transactional database. Therefore, the transactional database here refers to a broader category. Discovering knowledge from the transaction database is an issue that has been studied earlier but is still very active in data mining. By using specific technologies to mine transaction databases, you can obtain knowledge modes such as association rules, classification, clustering, and Prediction of dynamic behaviors.

I. Data Mining in relational databases
A relational database is composed of a series of data tables. Its development is quite mature. It has mature Semantic Models (such as entity-relational models) and mature DBMS (such as Oracle ), mature query languages (such as SQL languages) are available, and a number of visual tools can be used or used for reference. With the popularization and deepening of relational database applications, people are thinking about the problem of using it at a higher level, that is, data mining in relational databases. From a relational database, we can obtain the desired knowledge type or model based on the mining target, such as the Generalized knowledge, associated knowledge, class knowledge, predictive knowledge and special knowledge.
Data Mining in relational databases has accumulated many methods and achievements. In fact, the transaction-type database mentioned above can be seen as a special case of relational databases, and its research results can be exploited through transformation. The current research tends to integrate multiple technologies to solve practical application problems based on the characteristics of relational databases.
(1) multidimensional knowledge mining
The knowledge of traditional transaction database mining is generally single-dimension (single-dimension. For example, the knowledge of "people who buy computers also buy printers" depicts the association between products that take the "purchase" behavior as the focal point (dimension. However, this knowledge may not be enough in relational databases. For example, people may want to know "what kind of computers are more likely to buy printers ?", Therefore, knowledge like "high-income people buy printers when buying computers" is more necessary. Since relational databases can store basic customer information including income and customer purchase records, such knowledge can be obtained. This knowledge is multidimensional because it has two focal points: Purchase and income. In addition, the multi-dimensional concept may naturally be associated with multi-dimensional databases. Indeed, multidimensional databases in data warehouse, oalkaline, and other research can become a more ideal carrier for Multidimensional Data Mining.
(2) Multi-Table mining and quantitative data mining
We believe that this is two important issues that a relational database is different from the traditional transaction database mining. Logically, relational databases are a collection of tables. Therefore, in relational database mining, in addition to the Association of attributes in the table, the association of attributes between tables must also be considered. Traditional technologies and algorithms used in transaction database mining are generally based on a single table. Therefore, the multi-Table mining technology must be considered in relational database mining. In addition, relational databases may have the quantity attribute (such as salary ).
(3) multi-layer knowledge mining
Data and its association can always be understood at multiple different conceptual layers. Contact the multi-level generalized knowledge mining problem described above. Under certain background knowledge, a relational database can mine relevant knowledge at multiple conceptual levels. In 1995, srikant and Agrawal established the idea of studying multi-layer knowledge mining based on the generalized knowledge mining framework, and put forward concepts such as R-degree of interest. Another representative task is Han's research on the multi-layer knowledge mining of large databases.
(4) Knowledge evaluation Problems
In 1996, Chen and Han discovered problems in mining strong association rules based on Agrawal's rule discovery theory. The example they gave at the time was that in a shopping basket database, the association rules were found through the Apriori algorithm: Buy (x, 'computer games ') => buy (x, 'videoos ') [Support = 40%, Confidence = 66%]. However, in fact, computer games and video products are negatively correlated, that is, customers who buy one of them actually reduce the possibility of buying another. Therefore, the Knowledge evaluation of the traditional data mining framework is also a problem that must be solved in the practical application of data mining in relational databases. In recent years, there have been many researches on the evaluation and improvement methods of Knowledge mined in relational databases.
(5) Constraints on Data Mining
The data mining system is implemented under the guidance of users, which can improve the efficiency and accuracy of data mining. Its research is a broad topic. In visual and interactive data mining, the use and input of user constraints is the prerequisite for visual and interactive data mining. For relational databases, due to the complexity of their attributes (such as the existence of a large number of attributes), the implication storage of attribute associations, and the concept of multi-table or multi-layer, the issue of constrained data mining becomes more important.

Data Mining in relational databases is a field of high application value and requires further research on many topics. In addition, its research is not isolated. It not only needs to rely on the theoretical architecture that tends to be formed, but also has already interacted and supplemented with other data storage types, such as transaction databases and data warehouses.

2. Data Mining in Data Warehouses
Data in a data warehouse is organized by topic. Stored data can provide information from a historical point of view. In the face of multiple data sources, the cleaned and converted data warehouse can provide an ideal environment for data mining to discover knowledge. If a data warehouse model is supported by a multi-dimensional data model or a multi-dimensional data cube model, the operator based on the multi-dimensional data cube can achieve efficient computing and fast access. Although some of the current data warehouse auxiliary tools can help complete data analysis, it still requires new technologies to discover the knowledge models hidden in the data and to complete high-level work based on knowledge engineering methods. Therefore, it is necessary to study the data mining technology in the data warehouse.
Data mining is not only produced along with the data warehouse, but also has brought about many new issues as the application deepens. If we use data mining as an advanced data analysis tool, it is developed along with the data warehouse technology. With the emergence of Data Warehouse Technology, Online Analytical Processing applications have emerged. Although OLAP is different from data mining in many aspects, it has a large degree of overlap in the application target, that is, they are not satisfied with simple applications that traditional databases only use for online queries, but are pursuing advanced analysis applications based on large datasets. Objectively speaking, data mining focuses more on the Knowledge Representation Model formed after data analysis, while OLAP focuses more on data aggregation using multidimensional and other advanced data models. In a sense, we can regard data mining as an advanced form of OLAP. A closer term may be called OLAM (Online Analytical Mining ). Since data warehouse, OLAP, and data mining technologies are all proposed for advanced data analysis applications, they often put them together for research in the early days. Now, with the deepening of research, they have been focused on both research and application.

Iii. Data Mining in new databases developed based on Relational Models
New databases such as object-oriented database, object-relational database, and deduction have become the new research objects of data mining. With the development of database technology, these database systems were born and developed to meet new application requirements. Data Mining on these new database systems has become an unavoidable challenge.

4. Application-oriented data mining in new data sources
Some databases for new applications, such as spatial databases, temporal databases, engineering databases, and multimedia databases, have been fully developed. These new applications need to process and analyze spatial data, temporal data, engineering design data, and multimedia data. These applications require efficient data structures and available methods for processing complex structures, long variable records, semi-structured or unstructured data. For example, a satellite image may represent data in the form of a grating, while a city map data may be in the form of a vector. These grating or vector data also contain a wealth of knowledge and their mining technology has its own characteristics. Through a satellite image for climate analysis, we may need to know the correlation between altitude and climate. Through a city map, we may be eager to know the relationship between high-income families and their locations. Temporal Databases always contain temporal attributes, which are sensitive to time changes. For example, the stock data records the time-varying data sequence, through which we can mine the data development trend and help us develop the right investment strategy. Knowledge Discovery on these datasets or databases provides a wealth of Research and Development soil for data mining.

V. Data Mining in Web Data sources
Web-oriented data mining is much more complex than data mining for databases and data warehouses, because Web data is complex. Some are unstructured (such as web pages), and long sentences or phrases are usually used to express document information. Some may be semi-structured (such as email or HTML pages ). Of course, some have good structures (such as workbooks ). Discovering the General descriptive features of these composite objects becomes an unshirkable responsibility for data mining.
Web mining must face the following key issues.
(1) heterogeneous data source environment
The information on a web site is a larger and more complex data body. If we regard each site information on the Web as a data source, these data sources are heterogeneous because the information and organization of each site are different. To use such massive data for data mining, you must first study the integration of heterogeneous data between sites. Only by integrating the data of these sites into a unified view can you obtain what you need. Second, we need to solve the problem of data query on the web, because if the required data cannot be obtained effectively, it is impossible to analyze, integrate, and process the data.
(2) semi-structured data structure
The data on the Web is different from the data in traditional databases. The data on the Web is more semi-structured. Web-oriented data mining must be based on the semi-structured model and semi-structured data model extraction technology. Looking for a semi-structured data model is the key to solving the problem. In addition to defining a semi-structured data model, a semi-structured model extraction technology is also required. We know that each site's data is independently designed, and the data itself has self-reporting and dynamic variability. Therefore, web-oriented data mining is a complex technology. XML (Extensible Markup Language) is a meta-markup language designed by the W3C to provide a format for describing structured data. The scalability and flexibility of XML allows XML to describe data in different types of application software, so as to describe the data records in the collected web pages. Because XML-based data is self-describing, data can be exchanged and processed without internal descriptions. Therefore, XML can easily combine data from different sources, making searching for heterogeneous data possible, and bringing hope to solve the difficulties of Web data mining.
(3) dynamically changing application environment
First, the Web information changes frequently, and information such as news and stock information is updated in real time. This high change is also reflected in Dynamic Links and Random Access to pages. Second, users on the web are unpredictable. Users have different knowledge backgrounds, interests, and access purposes. Finally, the data environment on the Web is highly noisy. Research shows that only 1% of the data on a web site may be related to a specific mining topic. These variables are also a must for Web data mining.

References:

Han J et al. Data Mining: concepts and techniques. Morgan Kaufmann Publishers, 2001.
Agrawal R et al. A. Mining Assocation rules between sets of items in large databases. In Proc. ACM sigmod Conf. on management of data. 1993: 207 ~ 216.
Http://www.acm.org/sigmod/sigmod02/eproceedings.
Agrawal R et al. Fast Algorithms for mining association rules in large databases. In Proc. 20th Int. conf. Very large databases, 1994: 478 ~ 499
Srikant R and Agrawal R. Mining Generalized Association Rules. In Proc. 21st Int. conf. Very large databases, 1995: 407 ~ 419.
Han J et al. Discovery of multiple-level association rules from large databases, In Proc. 21st Int. conf. Very large databases. zuiich, swizerland. sept. 1995: 420 ~ 431.
Brin S et al. beyond market baskets: generlizing association rules to correlations. in Proc. 1997 ACM sigmod Int. conf. management data. tucson, USA. 1997: 265-276. ahmed n et al. A note on "beyond market baskets: generlizing association rules to correlations. "sigkdd configurations. 2000, vol. 1: 48 ~ 48.
Pei J et al. Can we push more constraints into frequent pattern mining? In Proc. 2000 Int. conf. Knowledge Discovery and data mining. Boston, USA. AUG. 2000.
Grahne g et al. Efficient Mining of constrained correlated sets. In Proc. 2000 Int. conf. Data Engineering. San Diego, USA. Feb. 2000: 512 ~ 521.
Http://www.dmgroup.org.cn/zs.htm.
Http://www.dmgroup.org.cn/ppt/XML%20Index&Join.ppt.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.