Theory and method of data-spatial data mining technology
Gejoco
(Information Institute of Southwest Agricultural University 400716)
This paper briefly discusses the theory and characteristics of spatial database technology and spatial data mining technology, this paper analyzes the level and method of spatial data mining technology, and emphatically introduces the methods of spatial data mining, such as classification, clustering, association rules and so on, and points out the problems, development trends and directions that need to be solved in the current spatial data mining technology.
Keywords Spatial Data Mining Classification Clustering Association rules
0 Introduction
Geographic Information System (Geographic information system, referred to as GIS) is a comprehensive technology of computer science, geography, surveying, cartography and so on [1]. The basic technology of GIS is spatial database, map visualization and spatial analysis, and spatial database is the key of GIS. As the most active branch and knowledge acquisition means of the current database technology, spatial data Mining (GIS) has been used in GIS to promote the direction of intelligent and integration.
Characteristics of 1 spatial database and spatial data mining technology
With the development of database technology and the wide application of database management system, the amount of data stored in the database is also increasing rapidly, which hides a lot of information with decision meaning behind these massive data. However, most applications of today's database still remain in the query, in the retrieval stage, the rich knowledge hidden in the database is far from being fully exploited and used, and the sharp increase of data in the database and the difficulty of the processing and understanding of the database make a strong contrast, which leads to "people are flooded with data but hungry for knowledge" of the phenomenon.
In addition to its explicit information, spatial data in spatial database (data Warehouse) has abundant hidden information, such as digital elevation model (dem or Tin), besides the information of load elevation, it also implies geological lithology and tectonic information. The species of the plant is explicit information, But it also implied the level of climate and vertical zonal information, and so on. These implied information can only be displayed through data mining. Spatial data Mining (Mining, SDM), or the discovery of knowledge from spatial databases, is a new branch of data mining that is extended to solve the massive characteristics of spatial data, and refers to the extraction of implicit, The user is interested in the spatial or non spatial patterns and general characteristics of the process [2]. Because the object of SDM is mainly spatial database, and the spatial database not only stores the geometrical data and attribute data of space objects or object, but also stores the graph spatial relation between objects in space, so its processing method is different from the general data mining method. The essential difference between SDM and the traditional method of geo-data analysis is that SDM excavates information and discovers knowledge without a definite hypothesis, and the knowledge should have 3 characteristics, which are unknown, effective and practical.
Spatial data mining technology needs comprehensive data mining technology and spatial database technology, it can be used for the understanding of spatial data, the discovery of spatial relation and spatial and non spatial relation, the construction of spatial knowledge Base, the reorganization of spatial database and the optimization of query.
The main methods and characteristics of 2 spatial data mining technology
Commonly used spatial data mining techniques include: sequence analysis, classification analysis, prediction, clustering analysis, association rule Analysis, Time series analysis, rough set method and cloud theory. From the point of view of mining task and mining method, this paper emphatically introduces three kinds of important methods commonly used in classification analysis, cluster Analysis and association rule analysis.
2.1, classification analysis
Classification is a very important task in data mining, and it is currently used most commercially. The purpose of classification is to learn a classification function or a classification model (also often called a classifier) that maps data items in a database to one of the given categories. Classification and our well-known regression methods can be used for forecasting, both of which are designed to automatically derive the generalization of a given data from historical data records, so that future data can be predicted. Unlike the regression method, the output of the classification is the discrete category value, and the output of the regression is a continuous value. They are often represented as a decision tree, based on data values from the root of the search, along the branches of the data content to go up, go to the leaves can determine the category. The essence of spatial classification is the abstraction and generalization of a given set of data objects, which can be represented by a macro-tuple.
To construct a classifier, you need to have a training sample dataset as input. The training set consists of a set of database records or tuples, each of which is a feature vector consisting of a feature (also called a property) value, in addition to a training sample with a category tag. A specific sample can be in the form of: (V1, v2, ..., vn; c) where vi represents the field value and C represents the category.
The construction methods of classifier are statistical method, machine learning method, neural network method and so on. Statistical methods include Bayesian method and Nonparametric method (nearest neighbor learning or case-based learning), and corresponding knowledge representation is discriminant function and prototype case. The machine learning method includes the decision tree method and the rule induction method, the former corresponding to the decision tree or the discriminant tree, the latter is generally the production rule. The neural network method is mainly the reverse propagation (back-propagation, short BP) algorithm, its model representation is the forward feedback Neural network model (composed by the node representing the neuron and the edge representing the join weight value), the BP algorithm is essentially a nonlinear discriminant function [3]. In addition, a new method has arisen recently: rough set (rough set), whose knowledge representation is the production rule.
Different classifiers have different characteristics. There are three classifier evaluation or comparison scale: 1 prediction accuracy; 2 computational complexity; 3. The simplicity of the model description. Prediction accuracy is one of the most widely used comparative scales, especially for predictive classification tasks, and the current accepted method is the 10-tiered cross validation method. Computational complexity depends on the specific implementation details and hardware environment, in data mining, because the Operation object is a huge database, so the complexity of space and time is a very important link. For descriptive classification tasks, the simpler the model description is, the more popular it is. For example, the classifier construction method represented by rule induction is very useful, and the result of neural network method is difficult to understand.
In addition, it should be noted that the effect of classification is generally related to the characteristics of the data. Some of the data noise is large, some have missing value, some sparse distribution, some fields or attributes of strong correlation, and some attributes are discrete and some are continuous value or mixed type. It is generally accepted that there is no one way to fit the data of various characteristics.
Classification technology in practical applications is very important, such as: according to the location of the house to determine the grade of housing.
2.2 Cluster analysis
Clustering refers to the process of aggregating a sample of no class into a different group, and describing each such group according to the principle of "birds of a Feather". It is intended that the samples belonging to the same group should be similar to each other, and that the samples of the different groups should be sufficiently dissimilar. Unlike classification analysis, it is not known what group to divide into groups or groups, and what spatial rules are defined to define groups. The purpose of this paper is to find out the function relation between the attributes of spatial entities and the knowledge of mining to represent the mathematical equations of the attributes named variables. Clustering method includes statistic method, machine learning method, neural network method and database-oriented method. The algorithm of spatial data mining based on clustering analysis includes mean approximation algorithm [4], Clarans, BIRCH, Dbscan and so on. At present, the research of spatial data clustering analysis is a hotspot.
For spatial data, by using clustering analysis method, regional division can be automatically carried out according to the location and the existence of obstacles. For example, according to the distribution of ATM in different geographical location of the residents in the regional division, according to this information, can be effectively set up ATM planning, avoid waste, but also avoid losing every opportunity.
2.3 Analysis of association Rules
Association rule analysis is mainly used to find the relationship between different events, that is, when one thing occurs, another thing often happens. The focus of relevance analysis is to quickly identify events that have practical value associated with them. The main basis is: the probability of occurrence and conditional probability should conform to a certain statistical significance. The form of Spatial Association Rules is x->y[s%,c%], where x, Y is a set of spatial or non spatial predicate, s% represents the support degree of the rule, and c% the confidence degree of the rule. There are 3 types of spatial predicates: predicates that represent topological structures, predicates that represent space orientations, and predicates that represent distances [5]. A variety of spatial predicates can form spatial Association rules. For example, distance information (such as close_to (near), Far_away (away)), topological relationships (Intersect (intersection), overlap (overlap), disjoin (separation)) and spatial orientation (e.g. right_of (right), west_of (west)). In fact, most of the algorithms use the spatial data to improve the classification algorithm, making it suitable for mining the correlation of spatial data, which can determine the geographical location of another space entity according to one spatial entity, which is advantageous to the spatial location Query and the reconstruction of space entities. The approximate algorithm can be described as follows: (1) Finding the relevant spatial data according to the query request; (2) using the principle of proximity to describe spatial attributes and specific attributes, (3) filtering unimportant data according to the principle of minimum support, (4) further refining the data by other means (e.g. overlay) and (5) generating Association rules.
Association rules can usually be divided into two types: Boolean Association rules and multivalued Association rules. Multi-valued association rules are more complex, a natural idea is to convert it into Boolean association rules, because the mining of Spatial Association rules needs to compute many spatial relationships in a large number of spatial objects, so the cost is very high. -The optimization method of stepwise refinement can be used for spatial Association analysis, which first uses a fast algorithm to excavate a large dataset roughly once, and then further improves the quality of mining with the higher cost algorithm in the reduced data set. Because of its very high cost, the spatial correlation method needs further optimization.
For spatial data, the relevance of geographical location can be found by analyzing the association rules. For example, 85% of the large towns near the highway are adjacent to water, or the parking lot is found to be adjacent to the golf course.
The research direction of 3 spatial data mining technology
3.1 Processing different types of data
Most databases are relational, so it is critical to effectively perform data mining on relational databases. However, there are various data and databases in different application areas, and often contain complex data types, such as structural data, complex objects, transaction data, historical data, etc. Because of the diversity of data types and different data mining targets, it is impossible for a data mining system to process all kinds of data. Therefore, specific data mining systems need to be established for specific data types.
3.2 Validity and testability of data mining algorithms
Massive databases typically have hundreds of properties and tables and millions of of tuples. GB level database is not uncommon, TB level database has emerged, high-dimensional large database not only increased the search space, but also increased the likelihood of discovering error patterns. Therefore, it is necessary to use domain knowledge to reduce dimensionality and eliminate extraneous data to improve the efficiency of the algorithm. The algorithm of extracting knowledge from a large spatial database must be efficient and measurable, that is, the running time of data mining algorithm must be predictable and acceptable, and the algorithm of exponential and polynomial complexity is not practical. However, when the algorithm uses finite data to find the appropriate parameters for a particular model, it sometimes leads to the value of the object and reduces the efficiency.
3.3 Interactive User Interface
The results of data mining should accurately describe the requirements of data mining and be easy to express. The discovery knowledge is examined from different perspectives and expressed in different forms, with high-level language and graphical interface to represent data mining requirements and results. At present, many knowledge discovery systems and tools lack the interaction with users, and it is difficult to use domain knowledge effectively. In this paper, Bayesian method and the interpretation ability of the database can be used to discover knowledge.
3.4 Interactive mining of knowledge on multiple abstraction layers
It is difficult to predict what knowledge will be mined from the database, so a high-level data mining query should serve as a clue for further inquiry. Interactive mining enables users to define a data mining requirement interactively, deepen the data mining process, and flexibly view the data mining results on multiple abstraction layers from different angles.
3.5 Mining information from different data sources
LAN, WAN and Internet network combine multiple data sources into a large distributed and heterogeneous database, and mining knowledge from formatted and unformatted data containing different semantics is a challenge to data mining. Data mining can reveal the knowledge that common queries found in large heterogeneous databases can not be discovered. The large scale, wide distribution and computational complexity of data mining methods require the establishment of parallel distributed data mining.
3.6 Privacy and security
Data mining can view data from different angles and different layers of abstraction, which will affect the privacy and security of data mining. By studying the data intrusion caused by data mining, the database security method can be improved to avoid information leakage.
3.7 Integration with other systems
The application scope of the method and the single function discovery system must be limited. In order to find out the knowledge in a wider area, spatial data mining system should be the integration of database, knowledge Base, expert system, decision support system, visualization tool and network technology.
4 issues to be studied
Although we have made great achievements in the research and application of spatial data mining technology, there are still some problems that need to be solved in some theories and applications.
4.1 Efficiency and scalability of data access
With the complexity of spatial data and the large amount of data, the emergence of terabytes database will increase the search space of discovery algorithm and increase the blindness of searching. How to effectively remove the task-independent data, reduce the dimension of the problem, design a more efficient mining algorithm for spatial data mining presents a great challenge.
4.2 The improvement of the lack of time attribute and static storage for some GIS software at present
Because the application of data mining is related to the time series relationship, the static data storage seriously hinders the application of data mining. The calculation mode based on layer and the complete separation between different scale spatial data also set up many obstacles to the spatial data mining. The connection between spatial entity and attribute data depends only on the identification code, this one-dimensional connection method will undoubtedly lose a lot of connection information, can not effectively express the implicit internal connection relationship, which increases the complexity of data mining calculation, greatly increased the data preparation phase of the workload and the degree of human intervention.
4.3 Refining of discovery Patterns
A large number of results are obtained when the space is found to be large, although some are irrelevant or meaningless patterns, at which point the knowledge of the domain can be further refined to refine the pattern of discovery, thus obtaining meaningful knowledge.
In the field of spatial data mining, the important research and application directions include: Data mining on network environment, mining of grid vector integration, data mining under uncertainty, data mining in distributed environment, data Mining Query Language and new efficient mining algorithm.
5 Summary
With the development of GIS and data mining and related fields, the technology of spatial data mining is deepening in breadth and depth, and in the near future, a GIS, GPS and RS Integration System integrating mining technology will develop towards intellectualization, network, globalization and popularization.
Reference documents:
[1] Shelen and other geographic information systems-principles, methods and applications-science press. 2001.
[2] di Kechang. Theory and method of spatial data mining and knowledge discovery [D]. Wuhan: Wuhan University of Surveying and Mapping, 1999.
[3] Zai Zixing, Xuguangtian. Artificial intelligence and its applications. Tsinghua University Press. 1999.206~216.
[4] Sheikholeslami G, Chatterjee S, Zhang A. wave-cluster:a multi-resolution Clustering approach for very large Databases. In:proceedings of the 24th International Conference on Very Large Databases. New York, 1998. 428~439.
[5] Ju Jianchuo, Zhangxiaohui, Mushroom Wei Jie, Zhu Yangyang. A brief analysis of data mining language [Z].
Http://www.sqlmine.com/warehouse/htm/40.htm.
The Technology and Methods of Spatial Data Mining
Ge Ji-ke
(Information College South West agricultural University Chongqing 400716)
Abstract:this paper introduces the theory and characteristic of spatial database and spatial data mining, analyses the HI Erarchy method and knowledge ' s classification of spatial data Mining, introduces spatial classification rules, spatial CL Ustering rules and Spatial Association rules, points out unsolved question, trend and direction.
Key words:spatial Data Mining, classification, clustering, association rules
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.