With the rapid development of database technology and the widespread use of electricity in the database of electric storage data is more and more large gate in the field of data mining to use scientific methods, method to reduce the time of mining algorithm to make data mining efficiency more high gate
1 Mining concepts of data
The knowledge discovery in database is also called data Mining Tao Database domain research and artificial intelligence is the hot issue. Gate data mining is to find out the previously unknown and potentially valuable information process gate data mining from a database with a large amount of data, which is the decision support process Tao it tomb in pattern recognition, artificial intelligence, machine learning, database, Visualization, statistics and other technologies Tao automated analysis of enterprise data for reasoning electricity mining out potential patterns to help policymakers adjust their strategies Tao make the right decision port
The process of discovering potentially valuable information Tao consists of three steps: The first is the data preparation, the second is the data mining Tao the third is the mining data result expression and interpretation port data mining can be related to the knowledge base or user interaction door
Data mining is to find its law in a large amount of data Tao prepare data, find the law and express the law of interpretation the data is to pick the data from the data source and synthesize the data set used as data mining: To find the law is to find out the laws contained in the data set: Mining data results expression and interpretation is to show the law of finding out the mouth.
Data mining tasks include cluster analysis, association analysis, specific group analysis, classification analysis, and evolutionary analysis.
2 characteristics and nature of data mining
According to the conventional theory of Electric J, the point that conventional data analysis is different from the key point of data mining is that the electric regular data analysis focuses on the cross-reporting, descriptive statistics, hypothesis testing and other electric data mining focus on the prediction, classification, clustering and correlation of 4 kinds of questions the broad view that any information mined from a database is called data mining It looks like the electric data mining is the business intelligence port. If the electric data mining refers to the data mining that the former data has been cleaned and transformed into a suitable mine, then it is in this God has a fixed form of data set on the completion of the knowledge extraction Tao with the appropriate knowledge mode to do the next analysis of decision-making work mouth through the above analysis Tao The author defines data mining as: Data mining is the process of mining and refining knowledge from the data set
3 Sampling methods for data mining
Sampling is a mature statistical technique Tao has been studied in the last hundred years electric random sampling technology is so door in the field of data management the effectiveness of random sampling has many described Tao random sampling can capture a small subset of data basic characteristics of data subsets to represent the total Similar or approximate query results can be obtained based on this set of samples the sample set can also be used for data mining in recent years, sampling technology has been used in many fields Tao and achieved very good effect electricity this fully illustrates the application of sampling technology more and more widely popular door
Sampling methods and classifications: Data items according to whether the selected data in the sampling technology is the same electric sampling method can be divided into bias sampling and uniform sampling two types of bias sampling in the same way the probability of the selection of elements may not be the same as in the uniform sampling of the probability of each element is the same port the same sampling probability can be the same ruler In-inch sampling produces interactive homogeneous sampling classic Two designs are electric Bernoulli sampling and reservoir sampling Tao These two sampling methods are the basis of all other sampling methods Membenuli sampling is
Uniform sampling, Its main feature is the use of a short time, the operation of simple door generation of uniform sampling and the size of K Tao if a number of elements arrive at the time the element in the electrical data stream is selected by the probability of Kl N When the sample set size exceeds K Tao, the sample will be randomly removed from the battery. The probability of each element's inclusion is the same. The sampling method of the gate reservoir is very important for the random uniform sampling method Tao is developed from the original traditional method to the database domain, the gate size space is fixed, the time complexity is zero Tao more suitable for mining data flow Environment Tao successful sampling technology ensures that the sampling quality of the door from the angle of improving sampling quality Three types of sampling strategies have been adopted: the first electric progressive sampling Tao progressive sampling begins with a small sample Tao slowly then increases the sampling rate or sample size Tao until the sampling is correct: The second Tao the feature assumptions of the dataset from the experimental sample set or the pre-evaluation Tao on the basis of the sample: The third Tao extracts specific data features for specific applications Tao rather than producing a sampling set that can be applied to a variety of applications
4 ways to dig SAS data
sas/en enables data marts and Tao with data warehousing and business intelligence reporting tools. It has data sampling tools, data acquisition tools, data mining tools, data screening tools, data mining processes, data variable conversion tools and data mining evaluation tools door
The first electricity data sampling gate carries on the data sampling from the enterprise large amount of data to find the sample data subset to explore the problem the electricity is not called all the data port in the process of data sampling to ensure that the quality of the data Tao ensure the validity, authenticity, completeness and representativeness of sampled data only in this way can the future analysis and research The results of the regularity of the door
Second Tao Explore data features Tao preprocessing analysis and sub-processing analysis the door has a sample data set Tao see if it is not up to the previous assumptions Tao trends and laws are obvious Tao whether there is no thought of the data state Tao factors are related to the power of these content is the first to explore the characteristics of the door analysis exploration Data Tao visual The most ideal way to operate the port
Third electrical technology selection and data adjustment, the problem of the clear door want to solve the problem more clear when the power to solve the problem to further quantify the problem after the quantification of the basis of electricity can be asked to examine the data set according to the problem needs to see whether it is to adapt to the need to Tao data to delete or increase electricity in the data mining process will Have a new understanding Tao generate or combine new variables Tao a valid description of the state will be fully reflected in the door
5 Conclusion
With the rapid development of computer science, electric data mining has become an important tool Tao This paper analyzes the concept, feature, essence and sampling method of data mining in detail Tao hope to make some contribution to the optimization of data mining calculation process.
Data mining-A study of concepts and sampling methods