A lot of good papers were quoted in this paper, so I read this 06 paper. Abstract
Introduces 10 challenging questions in data mining and a high-level guide to analyzing where data mining problems are occurring.
This article was written by the author by consulting some of the most active data mining and machine learning researchers (organizers of IEEE ICDM and ACM KDD Conferences) on their views on important and valuable topics in future data mining research. 1. Developing a Unifying Theory of Data Mining (Unified theory)
The current state of data mining research is too "special". Many techniques are designed for individual problems, such as classification or clustering, but there is no unified theory. However, unifying different data mining tasks (including clustering, classification, association rules, etc.) and the theoretical framework of different data mining methods (such as statistics, machine learning, database systems, etc.) will contribute to this area and provide the basis for future research.
Data mining researchers also have the opportunity to address some of the long-standing problems of statistical research, such as the old problem of avoiding false correlations. This is sometimes related to the problem of digging deep knowledge, which is a hidden cause of many observations. In Hong Kong, for example, the timing of a particular star's TV show is strongly correlated with the small market crash in Hong Kong. However, it is to be concluded that the reasons behind this correlation are too rash. Another example is whether we can find Newton's law by observing the motion of the object. 2. Scaling up-for-high dimensional data-and-fast data Streams (hi-dimensional & high-speed traffic)
One challenge is how to design classifiers to handle hyper-dimensional classification problems. How to build a classifier with millions of or billions of features, especially in the field of text and drug safety analysis.
Another problem is mining the data stream in a huge database. One is the processing of streaming data, on the other hand, data mining should be a continuous online process, rather than just one. This high-speed data stream, the amount of data is huge, how to incremental mining and establish an effective model update to maintain the accurate modeling of current flow. 3. Mining Sequence data (serial data and time series)
How to classify and predict the trend of sequence data and time series data effectively is still an important open topic.
The problem of noise pollution in the time series. How to learn meaningful data from noise data. It also includes the use of signal processing technology to eliminate the noise caused by the filtering of the data after the introduction of lag problem. How to overcome the lag time. The key issues of noisy time series include: Information/search agent Get information: Use error, too many or too few search conditions; may be inconsistent with information from many sources; (meta) semantic analysis of information; The information is assimilated into the input of the prediction agent. Earner/miner modifying information selection criteria: assigning deviations to feedback; Establish rules for searching agents to collect information; Establish rules for information agents to absorb information. Predicting trends through predictive media: combining qualitative information; Multi-Objective optimization is not a closed form. 4. Mining Complex knowledge from Complex data (complex data & complex knowledge) Graphics class complex knowledge. How to discover the themes of graphics and structured patterns from big data. Non-I.I.D data (non-independent distribution). Objects are not independent of each other and are not of a single type. How to dig rich relational structures among objects, such as Internet pages, social networks, metabolic networks in cells, etc. How to mine non-relational data. Most organizations have data in the form of text, not databases, and more complex data formats, including images, multimedia, and web data. Therefore, it is necessary to study data mining methods that transcend classification and clustering. Includes how to better automatically summarize text and how to identify the movement of objects and people in Web and wireless data logs to discover useful space and time knowledge. Knowledge reasoning. How to integrate data mining and knowledge inference. How to incorporate background knowledge into data mining. How to relate the results of the excavation to the real-world decisions it affects-what the digger can do is turn the results back to the user. Discover topics of interest to users. 5. Data Mining in a network Setting (net mining) 5.1. Community and social networks (social network) identifies community structures (such as topologies and clusters) of social networks. Dynamic behavior (such as growth factors, robustness, and functional efficiency). Also exists in bioinformatics research. 5.2. Mining in and for computer networks-high-speed Mining of high-speed streams
Computer (communication) network mining problems. To be able to detect anomalies, such as spikes in burst traffic due to DOS (denial of service) attacks or disaster events, service providers need to be able to capture IP packets at high link speeds and analyze large amounts of data (hundreds of GB). A highly scalable solution is required here. A Dos attack needs to be detected and traced to find out who the attacker is. Discards packets that are part of the attack traffic. 6. Distributed Data Mining and Mining multi-age
How to tap a variety of heterogeneous data sources: Multi-database and multi-relational mining.
Adversary data mining. How data mining systems deliberately manipulate data (such as counterterrorism, spam) to disrupt their opponents (for example, to give them false negatives). How to combine data mining with game theory. 7. Data Mining for Biological and environmental problems (biology and environment)
How to excavate biological data, such as the application of data mining and HIV vaccine design, the design of DNA, chemical properties, three-dimensional structure and functional characteristics and so on.
How to understand and utilize the natural environment and resources. such as mining, check the climate, autonomous mobile sensor networks.
Natural environment design to dynamic time behavior pattern recognition and prediction problem: 1). Very large-scale systems (such as global climate change and potential "avian influenza" epidemics) and 2). Human-centric systems (such as user-adapted human-computer interaction or peer-to trade).
To summarize these issues, there are currently three challenging applications: Bioinformatics, CRM/Personalization and security applications. 8. Data Mining process-related Problems
How to improve data mining tools and processes through automation, including how to automate the composition of data mining operations, and how to build methods into data mining systems to help users avoid many data mining errors and reduce labor costs. How to automatically clean up data. Data preprocessing accounts for a lot of labor costs and how to reduce them. How to perform systematic documentation of data cleaning. (The system documentation for how to perform the cleanup.) ) Combines visual interaction and automated data mining techniques. Visualization helps to understand data more and define/optimize data mining tasks. Develop a theory that supports interactive interpretation of large/complex data sets. 9. Security, Privacy, and data Integrity (Safety, privacy and integrity)
The issue of privacy protection in data mining.
Knowledge integrity Assessment issues. Not only to assess the knowledge integrity of the data set, but also to develop measures to assess the integrity of the individual model knowledge. Among the problems are algorithms that compare the knowledge content of multiple versions of data. How to estimate the effect of data modification on data mining algorithms. dealing with non-static, unbalanced and cost-sensitive Data
Dealing with non-static, unbalanced, and cost-sensitive data data is not static, how to include time in a learning model, or to correct time skew. How to handle unbalanced data. For some data sets that are small and highly unbalanced, how to handle them. Cost-sensitive data. Information about costs and benefits, how to build a model of overall profitability and loss. Different examples build different cost matrices, but the multi-output cost matrix is unknown and how to get the whole model through partial sampling. Reference
Http://www.cs.uvm.edu/~icdm/10Problems/10Problems-06.pdf