J. H. Friedman
Stanfo University Statistics Department and Linear Acceleration Center
Abstract: DM (Data Mining) is a discipline that reveals patterns in data and relationships between data. It emphasizes the processing of a large number of observed databases. It is an edge discipline involving database management, artificial intelligence, machine learning, pattern recognition, and data visualization. From the statistical point of view, it can be seen as automatic exploratory analysis of a large number of complex datasets through computers. Although its role in this discipline is somewhat exaggerated, this field has a great impact on business, industry, and scientific research, it also provides a lot of research work to promote the development of new methods. Although there is a clear link between data mining and statistical analysis, most of the data mining methods have not been produced by the Statistics discipline so far. This article explains this phenomenon and explains why statisticians should pay attention to data mining. Statistics may have a big impact on data mining, but this may require statisticians to change some of their basic ideas and operational principles. 1. Preface: The views in this article only represent the views of the author. They do not necessarily reflect the views of editors, sponsors, stanfo University, and colleagues. The theme of 29th forums (on the Interface) (May 1997, Houston, TX) is Data Mining and Analysis of big data sets. The subject of this meeting is the same as that of a meeting sponsored by Leo breiman, two decades ago, on massive complex data analysis sponsored by ASA and IMS. 20 years later, it is extremely appropriate to discuss what we have done over the past 20 years. This article will discuss the following issues:
What is data mining?
What is statistics?
What is the relationship between them (if any )?
What Can statisticians do? (If possible)
Shoshould we want? 2. What is data mining? The definition of data mining is very vague. Its definition depends on the opinion and background of the definer. The following are some definitions in DM Literature: data mining is an important process for determining effective, new, possibly useful, and ultimately understandable data models. -- Fayyad. Data Mining is a process that extracts previously unknown, understandable, and executable information from large databases and uses it for key business decisions. -- Zekulin. Data Mining is used in the knowledge discovery process to identify unknown relations and patterns in data. -- Ferruzza Data Mining is a process of discovering useful patterns in data. -- Jonn Data Mining is a decision support process for studying large datasets for unknown information models. -- Parsaye data mining is...
. Decision Tree
. Neural Network
. Rule Inference
. Nearest Neighbor Method
. Genetic Algorithm
-- Although these definitions of data mining are a bit untouchable, Mehta has become a business. Like in the past gold rush, the goal is to 'development miner '. The biggest profit is selling tools to miners, rather than doing practical development. The concept of data mining is used as an equipment to sell computer hardware and software.
Hardware manufacturers emphasize that data mining requires high computing power. It must be stored, read and write very large databases quickly, and use intensive computing methods for this data. This requires a large disk space and a large number of R am computers are built in quickly. Data Mining opens a new market for these hardware. Software Providers emphasize competitive advantages. 'You 'd better keep up with your opponent when using it. 'At the same time, it will increase the value of traditional databases. Many organizations have a large number of businesses in processing inventories, bills, and accounting databases. The creation and maintenance of these databases are costly. Now, you only need to use a relatively small amount of investment in data mining tools to find the highly profitable information hidden in the Data 'gold block '. Currently, hardware and software vendors aim to quickly launch data mining products to advertise Data Mining before the market is saturated. If a company invests between $50 thousand and $100,000 in Data Mining packages, it may only be an experiment. People will not rush to purchase new products before they are proven to have a big advantage over old products. The following are some current data mining products:
IBM: 'Intelligent miner' 'smart miner'
Tandem: 'relational data miner' relational data miner'
Angosssoftware: 'knowledgeseeder' knowledge searcher'
Thinking Machines Corporation: 'darwintm'
Neovista software: 'asic'
Isl demo-systems, Inc.: 'Clementine' 'clemon City'
Datamind Corporation: 'datamind data cruncher'
Silicon Graphics: 'minset'
California scientific software: 'brainmaker'
Wizsoft Corporation: 'wizwhy'
Lockheed Corporation: 'recon'
SAS Corporation: in addition to these integrated software packages, 'sas enterprise miner' also has many specialized products. In addition, many consulting companies specialized in data mining have also been established. In this field, the difference between a statistician and a computer scientist is that when a statistician has an idea, he or she writes it into an article, while a computer scientist starts a company.
The current data mining products have the following features:
-- Charming graphic user interface
. Database (search language)
. A set of data analysis processes
-- Window Interface
. Flexible and convenient Input
-- Click button and input dialog box
-- Graph Analysis
-- Complex Graphic Output
-- Massive data Graph
-- Flexible graphic Interpretation
Tree, network, Flight Simulation
-- Convenient processing of results.
These packages are like data mining experts for decision makers.
The statistical analysis processes used in the current data mining software package include:
. Decision Tree inference (C4.5, cart, chaid)
. Rule inference (aq, cn2, recon, etc)
. Nearest Neighbor Method (reasonable solution)
. Clustering Method (data separation)
. Joint Rules (Market Basket Analysis)
. Feature Extraction
. Visualization
In addition, some include:
. Neural Network
. Bayesian Belief Network (graphic model)
. Genetic Algorithm. Support Vector Machine
. Self-Organized chart
. Neural Fuzzy System
Almost all packages do not include:
. Hypothesis Test
. Experiment Design
. Response Surface Model
. ANOVA, manova, etc.
. Linear Regression
. Discriminant Analysis
. Logarithm Regression
. Generalized Linear Model
. Regular correlation
. Principal Component Analysis
. Factor Analysis
The subsequent processes are the main part of the standard statistical package. Therefore, most of the methods in the current market-oriented data mining package are generated and developed beyond the discipline of statistics. The core statistics method has been ignored. 3 Why now? What's the rush?
The idea of Data learning has been put forward for a long time. But suddenly people become so interested in data mining, why? The main reason is that it has recently been linked to the database management field. Data, especially a large amount of data is stored in the database management system. Traditional DBMS is concentrated in the online processing process (oltp n-line transaction processing), that is, the purpose of data organization is to store and quickly restore a single record. They used to record inventory, salary table records, Bill records, delivery records, and so on.
Recently, the database management community is increasingly interested in using database management systems for decision-making support. Such a decision support system will allow Statistics query of the data originally used in the online conversion process. For example, 'the number of diapers sold by all our chain stores last month? ', The decision support system requires the 'data warehouse' structure. The data warehouse uses the same format to unify the data of an organization distributed across different departments into a single central database (usually 1 GB ). Sometimes a smaller sub-database can be built for special analysis. These are also called data marts. The decision support system is Online Analytical Processing (OLAP) and online link analysis process design. The online link analysis process is designed for multidimensional analysis. In the online relational analysis process, a database is organized by dimension, which is the logical class of attributes (variables. The data body can be viewed as a high-dimensional accidental event table. The online link analysis process supports the following types of queries:
Shows the total sales volume of spring sportswear department and the number of middle stores in California Big City Commercial Street
. Compare with small-City Stores
. Show all items with negative profit boundary values. If the online link analysis process is searched manually by the user, the user raises potential problems. If the result is obtained, additional search is required, the answer may imply further questions. Such an analysis process will not be followed by questions of interest, or the analyst will be exhausted or time-consuming. If you use the online link analysis process for data mining, it requires an experienced user who can stay awake and never get used to it. The user must repeatedly raise wide-ranging questions.
Data mining can also be performed using a Data Mining System (software). You only need to provide fuzzy instructions to automatically search for corresponding modes and display important items and predictions, or abnormal records .. What are the characteristics of items with negative profit boundary values?
. If you decide to develop the market for a product-predict its profit Boundary Value
. Search for features of items whose profit boundary values can be accurately predicted
Not all big databases are commercialized, for example, a large number of databases exist in science and engineering. These databases are usually associated with the computer's automatic receipt data, for example:
. Astronomical (sky chart)
. Meteorological (climate and environmental pollution monitoring station)
Satellite remote sensing
. High-energy physics
. Industrial Process Control
This data can also benefit from the data mining technology (in principle) 4 is data mining or intelligent training? The current interest in data mining has raised some issues in academia. Data Mining seems feasible as a business, but whether it can be regarded as a kind of intelligent training. Of course, it has a very important relationship with computer science. These include:
Efficient computing of a cluster (ROLAP)
. Quick three-dimensional (x * X) Search
. Offline pre-query to improve the speed of Online Search
Parallel Computing of Online Search
. Convert the DBMS method into a data mining algorithm.
. Disk-based instead of Ram implementation
Parallel implementation of basic data mining algorithms
From the perspective of statistical data analysis, we can ask whether the data mining method is smart training. So far, it can still be said to be, or not. The widely-known programs in the data mining package come from machine learning, pattern recognition, neural networks, and data visualization. They emphasize the existence of 'seeing and feeling 'and sensory. This does not seem to care about the specific performance, but to quickly occupy the market. Most of the current research in this field focuses on improving current machine learning methods and accelerating existing algorithms.
However, in the future, data mining is almost certainly a kind of intelligent training. When the efficiency of a technology is increased by ten times, people always have to seriously rethink how to apply it. Think about the history of human beings moving from flight to flight. Every increase is about ten times that of the past, and every increase in volume changes our thoughts on how to use transport. C Huck Dickens (former slac's computing guide) once said: 'Every time the computer's capabilities increase by ten times, we should rethink about how we should calculate and calculate what the problem is. 'The corresponding argument may be 'every time the data volume is increased by ten times, we should rethink how to analyze it. 'The computer's processing capability and data volume have increased by several orders of magnitude since the time when almost most data mining tools have been used. New data mining methods will certainly be more intelligent and academic (commercial) in the future ). 5 should data mining be part of statistics? We used to give the vitality of Smart Data mining methods, but should statistics be concerned about its development as a discipline. Should we regard it as part of the statistics? What does that mean? At least it indicates that we should:
. Post such articles in our magazine.
In our undergraduate courses, what are the topics? BR>. Some related research topics are taught in our graduate students.
Some Awards (work, term of office, prizes) are provided to those who are superior in this respect ).
The answer is not obvious. In the history of statistics, we have neglected many new methods for development in other data processing-related fields. The following are examples of related fields. * Is a field of methods that sprout in the statistical science, but are subsequently ignored by statistics.
1. Pattern Recognition * -- CS/Engineering
2. Database Management-CS/Library Science
3 neural networks *-psychology/CS/Engineering
4. Machine Learning *-CS/AI
5. Graph Model * (beyes Network)-CS/AI
6 genetic engineering-CS/Engineering
7. Chemistry Statistics * -- chemistry
8. Data Visualization ** -- CS/scientific computing
It can be said that some 'statistician 'has already worked in these fields, but fairly they are not embraced (or enthusiastically embraced) by our field of statistics ). 6. What is statistics?
Since the relationship between the subjects that have obtained knowledge from data and statistics is so cold, we have to ask: 'What is not statistics '. If data connection is not a sufficient reason for a subject to become part of statistics, what is sufficient? So far, the definition of statistics seems to depend on some tools, that is, what we teach in the current postgraduate course. Here are some examples:
Probability Theory
. Real Analysis
. Measurement Theory
. Approximation Theory
Decision Theory
. Maelkov chain
. Yang
. Traversal Theory
.
The field of statistics seems to be defined as a family of questions that can raise the above or related tools. Of course, these tools will be useful both in the past and in the future. Just like Brad Efron reminds us: 'statistics is the most successful information science. ',' Those who ignore the statistics will be punished, and they will find the statistical method again in reality. '
Some people think that when the current data (and its related applications) grows exponentially and the number of statisticians obviously cannot keep up with this growth, our statistics should focus on the best part of information science, that is, probability inference based on mathematics. This is a highly conservative view, and of course it may be the best strategy. However, if we accept this idea, our statistician's Role in the "Information Revolution" wave will surely gradually disappear (fewer and fewer actors on this stage ). Of course, a good advantage of this strategy is that it has very few requirements for our innovation. We only need to stick to the rules.
Another idea was proposed by John Tukey [Tukey (1962)] as early as 1962. He believes that statistics should focus on data analysis. This field should be defined based on issues rather than tools, that is, data-related issues. If this idea becomes a mainstream one, we need to make great changes to our practices and academic topics.
First (most importantly), we should keep up with the pace of computing. Where there is data, there is computing. Once we regard the calculation method as a basic statistical tool (rather than a method that facilitates the implementation of our ready-made tools), many fields closely related to data will no longer exist. They will be part of our field.
Take computing tools seriously rather than simply using statistical packages-although this is also important. If computing becomes a basic research tool, there is no doubt that our students should learn relevant computing science knowledge. This will include numerical linear algebra, numerical and combination optimization, data structure, algorithm design, mechanical system, program design method, database management, parallel system, and program design. We will also expand our curriculum plan, which should include the current computer-targeted data analysis methods, most of which are developed outside the discipline of statistics.
If we want to compete for academic and commercial market space in other data-related fields, some of our basic models will have to change and we will have to adjust our fantasies about mathematics. Mathematics (like computing) is only a statistical tool. Although it is very important, it is not the only tool that can prove the effectiveness of statistical methods. Mathematics is not equivalent to theory, and vice versa. Theory is originally a creative comprehension and mathematics. Although this is very important, it is not the only way to do this. For example, there is little mathematics in the genetic theory of disease, but it makes people better understand many medical phenomena. We will acknowledge that the empirical validation method, although limited, is indeed a validation method.
We may have to change our culture. Every statistician involved in other data-related fields is shocked by their 'cultural gaps 'with statistics. In other fields, 'ideation' is more important than mathematical technology. An inspired 'ide' is considered valuable, and people will discuss its final value only when there is more detailed validation (theoretical or empirical. The way of thinking is 'if it is not proved to be guilty, it is innocent. 'This is inconsistent with our thinking in the field. In the past, if a new method was not proved effective by mathematics, we often slander it. Even if it is not, we will not accept it. This approach is reasonable when the dataset is relatively small and the information noise is relatively high. In particular, we should change the habits of the methods we slander that do well (usually in other areas) but are not understood by us. 7 which way to go? Maybe the current statistics are at a crossroads, and we can decide whether to accept or reject changes. As mentioned above, both ideas are convincing. Although there are many ideas, no one is sure which strategy can maintain the healthy development and vitality of our field. Most statisticians seem to think that statistics has less and less influence on information science. They do not agree with what to do. Our dominant opinion is that we have market problems. Our customers and colleagues in other fields do not understand our value and importance. This is also the opinion of the American Association of statistics, our main professional organization. In the five-year plan report made by the Strategic Planning Committee member (a mstat News-Feb.1997), there is a section "Enhancing the prestige and health of our discipline '. We recommend that you:
.
.
(The following content means that statistics is facing a crisis, market, and talent crisis. Statistics can play a role in data mining science. Statistics should cooperate with data mining, rather than throwing it away to computer scientists .)
Reference: Tukey, J. W. (1962). Future of data analysis Ann. statist.33, 1-67