Differences between data mining and statistical analysis
"Data Mining is based on statistical analysis, and most statistics analysis methods are used," said the instructor ". I have different points of view. Let's write something for your comments. We used to give the vitality of Data Mining Methods intelligence and regard it as an important development direction of business intelligence. But should statistics be concerned about its development as a discipline. Should we regard it as part of the statistics? What does that mean? At the very least, it indicates that we should: post such articles in our magazines; teach some of this content in our undergraduate courses, some related research topics are taught in our graduate students. Our doctoral courses include the "Multi-Dimensional Statistics" Course, which provides some rewards (jobs, titles, and prizes) to those who are excellent in this field ). The answer is not obvious. In the history of statistics, we have neglected many new methods for development in other data processing-related fields. The following are examples of related fields. * Is a field of methods that sprout in the statistical science, but are subsequently ignored by statistics. 1 pattern recognition *-CS/Engineering 2 database management-CS/Library Science 3 Neural Network *-psychology/CS/Engineering 4 machine learning *-CS/AI5 graphics model * (Beyes Network)) -CS/AI6 genetic engineering-CS/Engineering 7 Chemical Statistics *-chemical 8 data visualization **-CS/scientific computing can be affirmed that some statisticians have already worked in these fields, but to be fair, they are not accepted by our academic circle of statistics, but not by mainstream academic circles. At least I didn't hear any statistical teacher studying neural networks. Since the relationship between the subjects that have obtained knowledge from data and statistics is so cold, we have to ask: 'What is not statistics '. If data connection is not a sufficient reason for a subject to become part of statistics, what is sufficient? So far, the definition of statistics seems to depend on some tools, that is, what we teach in the current postgraduate course. The following are some examples :. probability theory. real analysis. measurement Theory. concept. decision Theory. markov chains. traversing the theoretical statistical field seems to be defined as a family of questions that can be raised by the above or related tools. Of course, these tools will be useful both in the past and in the future. Just as Brad Efron (Brad Efron, Department of Statistics Sequoia Hall 390 Serra Mall Stanford University Stanford) reminds us: "Statistics are the most successful information science. Those who ignore the statistics will be punished, and they will actually find the statistical method again ." Some people think that when the current data (and its related applications) grows exponentially and the number of statisticians obviously cannot keep up with this growth, our statistics should focus on the best part of information science, that is, probability inference based on mathematics. This is a highly conservative view, and of course it may be the best strategy. However, if we accept this idea, our statistician's Role in the "Information Revolution" wave will surely gradually disappear (fewer and fewer actors on this stage ). Of course, a good advantage of this strategy is that it has very few requirements for our innovation. We only need to stick to the rules. Another idea was proposed by John Tukey [Tukey (1962)] as early as 1962. He believes that statistics should focus on data analysis. This field should be defined based on issues rather than tools, that is, data-related issues. If this idea becomes a mainstream one, we need to make great changes to our practices and academic topics. First (most importantly), we should keep up with the pace of computing. Where there is data, there is computing. Once we regard the calculation method as a basic statistical tool (rather than a method that facilitates the implementation of our ready-made tools), many fields closely related to data will no longer exist. They will be part of our field. Take computing tools seriously rather than simply using statistical packages-although this is also important. If computing becomes a basic research tool, there is no doubt that our students should learn relevant computing science knowledge. This will include numerical linear algebra, numerical and combination optimization, data structure, algorithm design, mechanical system, program design method, database management, parallel system, and program design. We will also expand our curriculum plan, which should include the current computer-targeted data analysis methods, most of which are developed outside the discipline of statistics. If we want to compete for academic and commercial market space in other data-related fields, some of our basic models will have to change and we will have to adjust our fantasies about mathematics. Mathematics (like computing) is only a statistical tool. Although it is very important, it is not the only tool that can prove the effectiveness of statistical methods. Mathematics is not equivalent to theory, and vice versa. Theory is originally a creative comprehension and mathematics. Although this is very important, it is not the only way to do this. For example, there is little mathematics in the genetic theory of disease, but it makes people better understand many medical phenomena. We will acknowledge that the empirical validation method, although limited, is indeed a validation method. We may have to change our culture. Every statistician involved in other data-related fields is shocked by their 'cultural gaps 'with statistics. In other fields, 'ideation' is more important than mathematical technology. An inspired 'ide' is considered valuable, and people will discuss its final value only when there is more detailed validation (theoretical or empirical. The way of thinking is 'if it is not proved to be guilty, it is innocent. 'This is inconsistent with our thinking in the field. In the past, if a new method was not proved effective by mathematics, we often slander it. Even if it is not, we will not accept it. This approach is reasonable when the dataset is relatively small and the information noise is relatively high. In particular, we should change the habits of the methods we slander that do well (usually in other areas) but are not understood by us. In my personal opinion, maybe the current statistics are at a crossroads. We can decide whether to accept or reject changes. As mentioned above, both ideas are convincing. Although there are many ideas, no one is sure which strategy can maintain the healthy development and vitality of our field. Most statisticians seem to think that statistics has less and less influence on information science. They do not agree with what to do. Our dominant opinion is that we have market problems. Our customers and colleagues in other fields do not understand our value and importance. This is also the opinion of the American Association of statistics, our main professional organization. In the five-year plan report made by the Strategic Plan committee member (A mstat News-Feb.1997), there is A section 'Enhancing the prestige and health of our discipline ', which mentions "the following content means: statistics is facing a crisis, market, and talent crisis." Statistics can play a role in data mining science. Statistics should cooperate with data mining, rather than throwing it away to computer scientists. Some statistical experts believe that computers compete with them for the market, which is a superficial phenomenon. Taking our courses as an example, the teacher spoke very seriously, but many people do not have a statistical basis, which seriously affects students' understanding of the analysis process and results. Analysis software such as SPSS and SAS are excellent, but the results still need to be explained. The value of statistical experts lies in this. The visualization of Data Mining is more successful than the statistical analysis tool. In the context of the surging BI, enterprise data warehouses are evolving to a certain stage, and the market for data mining is growing, the concerns of statistical experts are becoming a reality. Data Mining is intended for end users, and the intermediate conversion link of statistical analysis increases application costs.