1. Data mining refers to a pattern of extracting useful knowledge information from a large amount of data.
(1) because the current life and work at any moment in the production of a large number of data and need to transform this data into useful information and knowledge, because the increasing demand will highlight the importance of data mining technology, so data mining should be the result of information technology development.
(2) Data mining should be a combination of these technologies rather than simple changes.
(3) database technology has led to the development of data collection technology and the mechanism of database establishment, with effective data management, including data storage, retrieval, query and transaction processing mechanism. The large number of database system provides the query and transaction processing, naturally produces the data analysis and the understanding necessity, is the data mining produces the driving force.
(4) Data cleansing, data integration, data selection, Data transformation, data mining, model evaluation, knowledge representation
2. The same and different points of the database and the Data Warehouse
Different: (1) The database is a transaction-oriented design, and the Data Warehouse is a subject-oriented design.
(2) database is generally stored online transaction data, Data Warehouse storage is generally historical data. Database design is to avoid redundancy as far as possible, generally adopt the rule of conforming to the pattern design, Data Warehouse design is intentionally introduce redundancy , adopt the way of inverse paradigm to design.
(3) The database is designed for capturing data, the Data Warehouse is designed for analyzing data, and its two basic elements are dimension tables and fact tables. Dimension is to look at the angle of the problem, such as time, department, the dimension table is the definition of these things, the fact table is the data to be queried, and the ID of the dimension.
The same: both the Data warehouse and the database are the storage systems of data or information, which store a large amount of persistent data.
3. Data characterization: Summary of the general characteristics or characteristics of the target class data.
Data differentiation: Compares the general characteristics of a target class data object with one or more general characteristics of a class object. Example: A consumer index is given for each quarter of a user's consumption.
Correlation and correlation Analysis: If there is a relationship between two or more things, one thing can be predicted by another, in order to excavate the correlation between the data. Example: Mining consumer sites for different age users of different products demand.
Classification: a function or model (also known as a classifier) that describes a data class can be extracted from a data set using classification techniques, and each object in the dataset is attributed to a known object class. Example: The database of credit card users divided into high school low three classes.
Regression: The functional relationship between the dependent and independent variables in the study data. For example, with the alternation of seasons, the volume of a commodity and time is a functional relationship.
Clustering: Clustering is an instruction-free learning. In other words, clustering is a method of information clustering based on the principle of information similarity in the case of pre-classification of classes. The purpose of clustering is to make the differences between objects belonging to the same category as small as possible, while the differences between objects on different categories are as large as possible. Example: to different consumer habits of the user clustering, respectively, push different services.
Outlier analysis: The use of statistical distribution, density (local outliers, suitable for non-uniform distribution of data), distance (parameter setting), deviation and other things from the data found inconsistent with the general behavior. Example: From the customer to tap some consumer power is particularly strong users.
1.4 The use of general e-commerce needs to push ads through the user's consumption records need to do data mining.
Can not simply from the database through the query, statistics, such as a user may be purchased for their relatives and friends of the goods, the judgment of gender and age of judgment in the process of data mining may be more detailed into: natural age and purchase age, natural sex and purchase gender, etc. Only a more detailed analysis can push users to more accurate information to attract customers.
1.5 (1) The difference between the distinction and classification is that the former focuses on the comparison of the general characteristics of the comparative data and the target class data, while the latter is by first finding a series of models describing or distinguishing the categories or concepts of data, and then using the model for predicting, estimating categories and tags of unknown data classes. The similarities are that they are both processing and analyzing categorical data.
(2) the difference between characterization and clustering is that the former is to find the general properties or characteristics of target categorical data, while the latter focuses on the analysis of unclassified data objects. The similarity between them is that they are both analytical and processing of high-correlation data objects or clustered objects.
(3) the difference between classification and prediction is that the former is to find a series of models that describe or differentiate between categories of data or concepts, while the latter predicts missing or hard-to-obtain, usually numeric-type data values. The similarity is that they are both predictive tools: categories are used to predict the category labels of data Objects, and predictions that are primarily used for missing numeric type data.
1.6 For example, in social networks, often rank some of the recent social network hot terms and things. These need to be analyzed through a large number of user-submitted blogs and posts, messages, tweets, and a mix of supervised and unsupervised combinations of top 10 top words.
1.7 (1) Statistical methods. The statistical method is a model-based approach that creates a model for the data and evaluates them based on the case of the object fit model. Most of the statistical methods used for outlier detection are to construct a probabilistic distribution model and consider how much the object might fit into the model. Probability definition of outliers: Outliers are an object, a probabilistic distribution model of data, which has a low probability. The premise of this is that you must know what distribution the data set is subject to, and if an error is estimated, it causes a heavy-tailed distribution. Hybrid model method for anomaly detection: For anomaly detection, data is modeled with two distributed mixed models, one distributed as normal data and the other outliers.
(2) Outlier detection based on the proximity degree. An object is unusual if it is far away from most points. This method is more general and easier to use than a statistical method, since it is easier to determine a meaningful proximity metric for a dataset than to determine its statistical distribution. The outlier score of an object is given by the distance to its K-nearest neighbor. Outliers score is highly sensitive to the value of K. If K is too small (for example, 1), a small number of neighboring outliers may result in a lower outlier score; If K is too large, then all objects in a cluster with fewer than k points may be outliers. To make the scheme more robust for the selection of K, the average distance of K nearest neighbors can be used.
(3) Density-based outlier detection. From a density-based point of view, outliers are objects in low-density areas. The outlier point score of an object is the inverse of the density around the object. Outlier detection based on density is closely related to outlier detection based on proximity, because density is usually defined by proximity. Detecting outliers with any density definition has features and limitations similar to the proximity-based outlier scheme.
1.8 (1) The data mining needs of different users are not the same. and different users may be interested in different kinds of knowledge. Therefore, it is necessary to conduct data mining to cover a wide range of knowledge discovery tasks.
(2) Multi-layered abstraction of knowledge-the data mining process needs to be interactive because it allows the user to focus on the search pattern and provide a data mining request based on the returned results.
(2) Pattern evaluation-it refers to the interest of the problem. Because whether they represent the common sense or the lack of novelty discovery patterns should be interesting.
1.9 Mass data is not only large, but also multi-source, heterogeneous, multimode and complex inline. These features present a huge challenge to data mining services: (1) Mining efficiency: In the face of large amounts of heterogeneous data after the introduction of Internet applications (it is expected that by 2020, the data volume of explosive growth will exceed 35ZB (1zb=10 million TB)), the current parallel mining algorithm is inefficient. (2) Multi-source data: The greatest feature of web data is semi-structured, such as documents, reports, Web pages, sound, images, video, etc., how to comb the effective data is a challenge. (3) for different data mining tasks, all kinds of application data to be integrated mining, refining a suitable for efficient use of business information infrastructure is also a challenge.
1.10 In the field of spatio-temporal data mining, a lot of valuable work according to the task of mining, mainly have time and space pattern discovery, spatiotemporal clustering, spatiotemporal anomaly detection, spatio-temporal prediction and classification, how to combine spatio-temporal reasoning with data mining is a great challenge, the geographic information system and data mining effectively integrated, Key challenges for example: How to use data mining technology to extract the knowledge and rules of spatial data hidden in spatial database, how the data mining algorithm obtains data in spatial database, and how to excavate the method of space pattern in advance unknown but potentially useful by automatic or semi-automatic in spatial database.