What is data mining?

Source: Internet
Author: User

Data Mining is the non-trivial process of obtaining effective, novel, potentially useful, and ultimately understandable patterns from a large amount of data. A broad view of data mining: Data mining is the process of "digging" interesting knowledge from a large amount of data stored in databases, data warehouses, or other repositories. Data mining, also known as Knowledge discovery in Databases (knowledge Discovery in Database, KDD), is considered as a basic step in the knowledge discovery process in the database. The knowledge discovery process consists of the following steps: (1) Data cleansing, (2) data integration, (3) data selection, (4) Data Transformation, (5) Data Mining, (6) Pattern evaluation, (7) Knowledge representation. Data mining can interact with users or knowledge bases.
Not all information discovery tasks are considered data mining. For example, using a database management system to find individual records, or searching through the Internet's search engines for specific Web pages, is a task in the field of information retrieval (information retrieval). While these tasks are important and may involve the use of complex algorithms and data structures, they rely primarily on traditional computer science and technology and the obvious characteristics of data to create index structures that effectively organize and retrieve information. However, data mining technology has also been used to enhance the ability of information retrieval systems.

The origins of data mining

Need is the mother of invention. In recent years, data mining has aroused great concern of the information industry, the main reason is that there is a large number of data, can be widely used, and the urgent need to convert this data into useful information and knowledge. Access to information and knowledge can be widely used in a variety of applications, including business management, production control, market analysis, engineering design and scientific exploration.
Data mining utilizes ideas from the following areas: (1) sampling, estimating and hypothesis testing from statistics, (2) Artificial intelligence, pattern recognition and machine learning search algorithms, modeling techniques, and learning theory. Data mining also quickly embraced ideas from other areas, including optimization, evolutionary computing, information theory, signal processing, visualization, and retrieval. Some other areas also play an important supporting role. In particular, database systems are required to provide efficient storage, indexing, and query processing support. Technologies that originate from high-performance (parallel) computing are often important in dealing with massive datasets. Distributed technology can also help to process massive amounts of data, and it is critical when data is not centralized for processing.

What data mining can do

1) Data mining can do the following six different things (analysis method):
· Category (classification)
· Valuation (estimation)
· Prophecy (prediction)
· Affinity Grouping or association rules (Affinity grouping or association rules)
· Aggregation (clustering)
· Description and visualization (Description and visualization)
· Complex data type mining (Text, Web, graphic image, video, audio, etc.)
2) Data Mining classification
The above six kinds of data mining analysis methods can be divided into two categories: direct data mining, indirect data mining
· Direct Data Mining
The goal is to use the available data to create a model that describes the remaining data, a specific variable (which can be understood as a property of a table in a database, that is, a column).
· Indirect data Mining
A specific variable is not selected in the target, described by the model, but a relationship is established in all variables.
· Classification, valuation, prediction belong to direct data mining; the latter three types of indirect data mining
3) Introduction to various analytical methods
· Category (classification)
Firstly, we select the training sets which are divided into good classes, and use the techniques of data mining classification to classify the data that are not classified in this training set.
Example:
A. Credit card applicants, classified as low, medium and high risk
B. Assigning customers to pre-defined customer shards
Note: The number of classes is deterministic, pre-defined
· Valuation (estimation)
Valuations are similar to classifications, except that the classification describes the output of discrete variables, while the output of successive values is processed by the valuation, and the category of classification is determined by the number of purposes and the amount of valuation is uncertain.
Example:
A. Estimate the number of children in a family according to the purchase model
B. Estimate the income of a household according to the mode of purchase
C. Estimating the value of real estate
In general, valuations can be used as a step forward in the classification process. Given some input data, the value of the unknown continuous variable is obtained by valuation, and then classified according to a predetermined threshold value. For example, the bank's family loan business, the use of valuation, to each customer score (score 0~1). Then, depending on the threshold, the loan level is categorized.
· Prophecy (prediction)
In general, predictions work by classification or valuation, that is, by classifying or valuing the model, which is used to predict unknown variables. In this sense, prophecy is not necessarily divided into a separate class. The prophecy is intended to predict the future of unknown variables, a prediction that takes time to verify, that is, after a certain amount of time, to know how accurate the prophecy is.
· Affinity Grouping or association rules (Affinity grouping or association rules)
Decide which things will happen together.
Example:
A. Customers in supermarkets buy a and often buy B, i.e. A and B (Association Rules)
B. After purchasing a, the customer will buy B (sequence analysis) after a period of time
· Aggregation (clustering)
Aggregation is the grouping of records, with similar records in one aggregation. The difference between aggregation and classification is that aggregation does not depend on pre-defined classes and does not require a training set.
Example:
A. The aggregation of some specific symptoms may herald a specific disease
B. Customers who rent VCD types that are not similar may imply that members belong to different sub-cultural groups
Aggregation is usually the first step in data mining. For example, "What kind of promotion is best for customers?" ", for this kind of problem, first gathers to the entire customer, the customer groups in the respective aggregation, then to each different aggregation, answers the question, may the effect is better."
· Description and visualization (Des cription and visualization)
is how data mining results are represented.

Association rules in Data mining

1. What are Association rules
Before describing some of the details about Association rules, let's look at an interesting story: "Diapers and beer."
In a supermarket, there is an interesting phenomenon: diapers and beer are impressively sold together. But the odd move has increased the volume of diapers and beer. It's not a joke, it's a real case of Wal-Mart stores in the US, and it's been a business story. Wal-Mart has the world's largest data warehouse system, in order to be able to accurately understand the customers in their stores buying habits, Wal-Mart's customers shopping behavior of shopping basket analysis, want to know customers often buy together what products. The Wal-Mart Data Warehouse focuses on the detailed raw transaction data for each of its stores. Based on these raw transaction data, Wal-Mart uses data mining methods to analyze and excavate these data. An unexpected discovery was: "The most purchased items with diapers are beer!" After a lot of practical investigation and analysis, it reveals a pattern of behavior that hides behind "diapers and beer": In America, some young fathers often go to supermarkets to buy baby diapers after work, while some of the 30%~40% are buying their own beer. The reason for this is that American wives often urge their husbands to buy diapers for their children after work, while husbands buy diapers and bring back their favorite beers.
According to conventional thinking, diapers and beer is irrelevant, if not by the use of data mining technology to the large number of transaction data mining analysis, Wal-Mart is impossible to find the data within this valuable law.
Data Association is a kind of important and discoverable knowledge that exists in the database. If there is a regularity between the values of two or more variables, it is called Association. Association can be divided into simple association, Timing Association, causal Association. The purpose of association analysis is to find out the hidden network of associations in the database. Sometimes it is not known that the associated functions of data in the database, even if known, are uncertain, so the rules generated by the association analysis are credible. Association rule mining discovers interesting associations or related relationships between itemsets in large amounts of data. Agrawal is equal to 1993 first, the author puts forward the problem of mining Association rules among the itemsets in the customer transaction database, and many researchers have studied the mining problem of association rules. Their work includes the optimization of the original algorithm, such as the introduction of random sampling, parallel ideas, to improve the efficiency of the algorithm mining rules, the application of association rules to promote. Association rules mining is an important topic in data mining, which has been researched extensively in recent years.
2. Association rules mining process, classification and related algorithms
2.1 Process of Mining Association rules
The association rule Mining process consists of two phases: the first phase must first identify all the high-frequency project groups (frequent itemsets) from the data collection, and the second phase is then generated by these high-frequency project groups (association Rules).
The first phase of association rule mining must identify all high-frequency project groups (Large itemsets) from the original data set. High frequency means that the frequency of a given project group must reach a certain level relative to all records. The frequency of the occurrence of a project group is called Support, as an example of a 2-itemset containing a and B two items, we can obtain the support degree of the project group containing {A, a, or} by the formula (1), if the support degree is greater than or equal to the minimum support level set (Minimum supports) Threshold value, {A, b} is called a high-frequency project group. A k-itemset that satisfies the minimum support is called a high-frequency K-project group (frequent K-itemset), which is generally expressed as large k or frequent K. The algorithm then generates large k+1 from the large K project group until a longer high-frequency project group can no longer be found.
The second phase of association rule mining is to generate Association Rules (Association). From the high-frequency project group to generate association rules, is the use of the previous step of the high-frequency K-project group to generate rules, the minimum reliability (Minimum Confidence) under the conditions of the threshold, if the degree of trust obtained by a rule satisfies the minimum trust degree, called this rule is the association rule. For example: through the high-frequency K-project group {A, B} generated by the rule AB, its reliability can be obtained through the formula (2), if the trust degree is greater than or equal to the minimum trust degree, then the AB is the association rule.
As far as the Wal-Mart case is concerned, using association rule Mining technology to data mining the records in the transaction database, we must set up two thresholds of minimum support degree and minimum Trust degree, and assume the minimum support degree min_support=5% and the minimum Trust degree min_confidence=70%. Therefore, the association rules that meet the demand of this supermarket will have to meet the above two conditions. If the association rules found by the excavation process "diapers, beer" meet the following conditions, you will be able to accept the "diaper, Beer" association rules. The formula can describe support (diapers, beer) >=5% and confidence (diapers, beer) >=70%. Among them, support (diapers, beer) >=5% in this application example: in all of the transaction data, at least 5% of the transactions show diapers and beer the two goods are bought simultaneously the transaction behavior. Confidence (diapers, beer) >=70% in this application example, the implication is that at least 70% of all transaction records containing diapers buy beer at the same time. Therefore, if there is a consumer to buy diapers in the future behavior, supermarkets will be able to recommend the consumer to buy beer at the same time. The recommended behavior is based on the "Diaper, Beer" association rules, because in terms of the supermarket's past trading record, it supports the consumption behavior of "most of the diaper-buying transactions and the purchase of beer at the same time."
As can be seen from the above Introduction, association rules mining is usually compared with the indicators in the record to take the discrete value of the situation. If the indicator values in the original database are continuous data, then the appropriate data discretization should be done before association rule mining (in fact, the value of an interval corresponds to a certain value), the data discretization is an important link before data mining, the discretization process is reasonable will directly affect the mining results of association rules.
2.2 Classification of association rules
Depending on the situation, the association rules can be categorized as follows:
1. Based on the categories of variables processed in rules, association rules can be categorized as Boolean and numeric.
The value of Boolean association rules processing is discrete and kind, it shows the relationship between these variables, and the numerical association rules can be combined with multi-dimension association or multi-layer association rules to deal with the numerical fields, dynamically segment them, or directly manipulate the original data. Of course, a type variable can also be included in a numeric association rule. For example: gender = "female" = "secretary", is a boolean association rule; gender = "female" =>avg (income) = 2300, the income involved is a numeric type, so it is a numeric association rule.
2. Based on the abstraction level of data in rules, it can be divided into single-layer association rules and Multilevel Association rules.
In the single-layer association rules, all the variables do not take into account that the actual data has many different levels, and in the multi-layered association rules, the data multilayer has been fully considered. For example, the IBM Desktop =>sony printer is a single-layer association rule on detail data; a desktop =>sony printer is a multilevel association rule between a higher level and a level of detail.
3. Based on the dimension of the data involved in the rules, the association rules can be divided into single-dimension and multidimensional.
In single-dimension's association rules, we only refer to one dimension of data, such as items purchased by users, and in multidimensional Association rules, the data to be processed will involve more than one dimension. In another sentence, the Single-dimension association rule deals with some relationships in a single attribute; A multidimensional association rule is the process of dealing with certain relationships between individual attributes. For example: beer and diaper, this rule only refers to the user's purchase of goods; sex = "female" = "secretary", this rule involves two fields of information, is an association rule on two dimensions.
2.3 Related algorithms for Mining Association rules
1.Apriori algorithm: Finding frequent itemsets with candidate Itemsets
Apriori algorithm is one of the most influential algorithms for mining Boolean association rule frequent itemsets. The core is the recursive algorithm based on the two-stage frequency set theory. The association rule belongs to single-dimension, single-Layer and Boolean association rules in classification. In this case, all itemsets with support degrees greater than the minimum support are called frequent itemsets, or frequency sets.
The basic idea of the algorithm is to find out all the frequency sets first, and these itemsets are at least as frequent as the predefined minimum support. The strong association rules are then generated by the frequency set, which must satisfy the minimum support and minimum confidence level. Then use the frequency set found in step 1th to produce the desired rule, producing all rules that contain only the items of the collection, where there is only one item on the right of each rule, and the definition of the middle rule is used here. Once these rules are generated, only those rules that are greater than the minimum confidence given by the user are left. In order to generate all the frequency sets, a recursive method is used.
The possibility of generating a large number of candidate sets, as well as the potential need for a duplicate scan of the database, are two major drawbacks of the Apriori algorithm.
2. Partitioning-based algorithms
Savasere, a partition-based algorithm was designed. The algorithm first divides the database logically into disjoint chunks, considers a block each time and generates all the frequency sets for it, then merges the resulting frequency set to generate all the possible frequency sets, and finally calculates the support degree of these itemsets. Here the size of the block is chosen so that each block can be placed in the primary, and each stage is scanned only once. The correctness of the algorithm is guaranteed by the frequency set in at least one block of each possible frequency set. The algorithm can be highly parallel, and each block can be assigned to a processor to generate a frequency set. After each loop that generates the frequency set, the processor communicates between the processors to produce a global candidate K-key set. Usually the communication process here is the main bottleneck of the algorithm execution time, on the other hand, the time that each independent processor generates the frequency set is also a bottleneck.
3.fp-Tree Frequency Set algorithm
For the inherent flaw of Apriori algorithm, J. Han proposes a method that does not produce candidate mining frequent itemsets: fp-tree frequency set algorithm. Using divide-and-conquer strategy, after the first scan, the frequency set in the database is compressed into a frequent mode tree (Fp-tree), while still retaining the associated information, and then the fp-tree into some of the conditions of the library, each library and a length of 1 frequency set correlation, These conditions are then mined separately. When the amount of raw data is very large, you can also combine the method of partitioning, so that a fp-tree can be put into main memory. The experiment shows that fp-growth has good adaptability to the rules of different lengths, and has a great improvement in efficiency compared with apriori algorithm.
3. The application of this field at home and abroad
3. 1 Application of association rule excavation technology at home and abroad
At present, association rules mining technology has been widely used in Western financial industry enterprises, it can successfully predict the needs of bank customers. Once this information is available, banks can improve their marketing. Now banks are developing new ways to communicate with customers every day. Banks on their ATMs are bundled with the bank's product information that customers may be interested in, for users of the bank's ATM machines. If the database shows that a customer with a high credit limit has replaced the address, the customer is likely to have recently purchased a larger home, so there may be a higher credit limit, a higher-end new credit card, or a housing improvement loan that can be mailed to the customer via a credit card bill. When a customer calls for advice, the database can be a powerful way to help a telemarketing representative. The Sales Representative's computer screen can show the customer's characteristics, but also can show the customer will be interested in what products.
At the same time, some well-known e-commerce sites also benefit from strong association rules mining. These e-shopping sites use rules in association rules for mining, and then set up bundles that users intend to purchase together. There are also some shopping sites that use them to set up the appropriate cross-sell, that is, customers who buy a particular product will see an advertisement for another product.
But at present, in our country, "the data is massive, the information lacks" is the commercial bank in the data large concentration widespread after the embarrassment. At present, most of the databases implemented by the financial industry can only realize low-level functions such as data entry, query and statistics, but they can not find all kinds of useful information in the data, such as analyzing these data, discovering their data patterns and characteristics, and then may discover the financial and commercial interests of a customer, consumer group or organization. And to observe the changing trend of financial market. It can be said that the research and application of association rules mining technology in our country is not very extensive in depth.
3. 2 Some research on Mining association rules in recent years
Because many application problems are more complicated than the supermarket purchase problem, a large number of researches extend the association rules from different angles, and integrate more factors into the mining method of association rules, so as to enrich the application domain of association rules and broaden the scope of supporting management decision. Consider the category hierarchy relationship between attributes, temporal relationships, multi-table mining, and so on. In recent years, the Research on Association rules focuses on two aspects, that is, extending the classical association rules can solve the problem, improve the efficiency of the Classical association rule Mining algorithm and the rule interest.

The realization of data mining technology

In the technology can be divided into its work process: data extraction, data storage and management, data presentation and other key technologies.
• Extraction of data
Data extraction is the entry of data into the warehouse. Because the Data warehouse is a separate data environment, it needs to import data from the online transaction processing system, external data sources, and offline data storage media through the extraction process. Data extraction technology mainly involves interconnection, replication, increment, transformation, scheduling and monitoring and other aspects of processing. In the aspect of data extraction, the future technology development will focus on the integration of system functions in order to adapt to the change of data warehouse itself or data source, make the system easier to manage and maintain.
• Storage and management of data
The method of organization and management of data Warehouse determines the characteristic of it differs from traditional database, and it also determines its representation to external data. Data Warehouse management involves a much larger amount of data than traditional transaction processing and accumulates quickly over time. The data storage and management in the data warehouse need to solve is how to manage the large amount of data, how to deal with the large amount of data in parallel, how to optimize the query and so on. At present, many database manufacturers to provide the technical solution is to extend the function of relational database, the normal relational database into a suitable server to serve as a data warehouse.
• Presentation of data
The main ways in which data are presented are:
Query: Implement the predefined query, dynamic query, OLAP query and decision Support Intelligent query; reports: Generate relational data tables, complex tables, OLAP tables, reports, and various comprehensive reports; visualize: with easy-to-understand dots, histograms, pie charts, network diagrams, interactive visualizations, dynamic simulations, The computer animation technology shows complex data and its interrelation; statistics: average, maximum, minimum, expectation, variance, summary, sorting and other statistical analysis; mining: Using data mining and other methods, from the data to get information about data relations and patterns.

Data mining and Data Warehouse fusion development

The collaborative work of data mining and data Warehouse, on one hand, can cater to and simplify the important steps in data mining, improve the efficiency and ability of data mining, ensure the universality and integrality of data source in data mining. On the other hand, data mining technology has become a very important and relatively independent aspect and tool in Data Warehouse application.
Data mining and Data Warehouse is a fusion and interactive development, its academic research value and application research prospects will be exciting. It is the data mining experts, data Warehouse technicians and industry experts to work together the results, but also the vast number of people eager from the database "slave" to the Database "master" transformation of the enterprise end-user pathway.

Statistics and data Mining

Statistics and data mining have a common goal: discovering the structure of data. In fact, because of their similar goals, some people (especially statisticians) think data mining is a branch of statistics. This is not a realistic view. Because data mining also applies ideas, tools, and methods in other fields, especially computer science, such as database technology and machine learning, and some of the areas that it focuses on are very different from the concerns of statisticians.
  
1. The nature of statistics

  
There is no point in trying to define the next too broad definition of statistics. Although possible, it will attract many objections. Instead, I would like to focus on the characteristics of statistics different from data mining.
The difference is related to the last paragraph of the previous section, that is, statistics is a relatively conservative subject, there is a trend is more and more accurate. Of course, this is not a bad thing in itself, only the more accurate to avoid mistakes and find truth. But if it's too much, it's bad. This conservative view stems from the view that statistics is a branch of mathematics, which I disagree with, although statistics are indeed mathematically based (just as physics and engineering are mathematically based, but not as branches of mathematics), but they are closely related to other disciplines.
The mathematical background and the quest for precision reinforces the tendency to prove before adopting a method, rather than focusing on experience like computer science and machine learning. This means that sometimes researchers in other areas of the same problem with statisticians present a clear and useful method, but it cannot be proved (or not proven). Statistical magazines tend to publish mathematically proven methods rather than special methods. As a synthesis of several disciplines, data mining has inherited the experimental attitude from machine learning. This does not mean that data mining workers do not focus on precision, but simply that if the method does not produce results, it is discarded.
It is the statistical literature that shows (or exaggerates) The mathematical accuracy of statistics. It also shows its emphasis on reasoning. While some branches of statistics focus on the description, a glance at the statistical papers reveals that the core issue of these documents is how to infer the population in the case of a sample. Of course, this is often the focus of data mining. Here we will mention that a particular property of data mining is to deal with a large data set. This means that, for reasons of feasibility, we often get only one sample, but we need to describe the large data set taken from the sample. However, data mining problems can often get data in general, such as all employee data about a company, all the customer data in the database, all the business of the last year. In this case, the inference is worthless (for example, the average of an annual business) because the observed value is the estimated parameter. This means that the established statistical model may use a series of probabilistic representations (for example, some parameters close to 0 are removed from the model), but when the overall data is available, it becomes meaningless in data mining. Here, we can easily apply the evaluation function: sufficient representation of the data. The fact is that it is often concerned with the suitability of the model rather than its viability, which in many cases makes the discovery of the model easy. For example, the simple characteristics of the anastomosis (for example, the application of branching theorem) are often used when searching for rules. But we don't get these features when we apply probability statements.
The third feature of the overlap in statistics and data mining is the "model" that plays a central role in modern statistics. Perhaps the term "model" means more of a change. On the one hand, the statistical model is based on the relationship between the analysis variables, but on the other hand these models on the data of the overall description does not make sense. The regression model of the credit card business may take income as a separate variable, because high incomes are generally thought to lead to big business. This may be a theoretical model (albeit based on a shaky theory). In contrast, a gradual search can be made on the basis of variables that may have explanatory meaning, thus obtaining a model of great predictive value, although it is not possible to make a reasonable explanation. (The latter is a common concern when discovering a model through data mining).
There are other ways to differentiate the statistical model, but I will not discuss it here. What I want to focus on here is that modern statistics are based on models. In the calculation, the model selection condition is secondary, just how to build a good model. But in data mining, it's not exactly the case. In data mining, the guidelines play a central role. (There are, of course, some independent exceptions to the norm in statistics.) GIFI's nonlinear multivariate analysis of schools is one of them. For example, GIFI says, in this book we hold the view that given some of the most commonly used MVA (multivariate analysis) problems, it is possible to start from a model or technology. As we have seen in section 1.1, model-based, classic multivariate statistical analysis, ... However, in many cases, the selection of the model is not always obvious, it is not possible to choose an appropriate model, and the most appropriate method of calculation is not feasible. In this case, from another point of view, we apply a series of design techniques to answer the MVA question, without considering the selection of the model and the optimal discriminant.
In contrast to statistics, the criterion plays a more central role in data mining, and it is not surprising that data mining inherits disciplines such as computer science and related disciplines. The size of the data set often means that traditional statistical criteria are not suitable for data mining problems and have to be redesigned. In part, the criteria for adaptability and continuity are often necessary when a number of points are applied individually to update estimates. Although some statistical guidelines have been developed, more applications are machine learning. (as shown in "learning")
  
2. The nature of data mining

  
Since the basis of statistics is based on the invention and development of computers, the commonly used statistical tools contain many methods that can be implemented by hand. Therefore, for many statisticians, 1000 data is already very large. But the big one is too far off for the UK's big credit card companies to have 350,000,000 business or 200,000,000 long-distance calls per day. Obviously, faced with so much data, it is necessary to design a method that is different from those that "can be manually implemented in principle". This means that the computer (which is what makes big data possible) is critical to the analysis and processing of the data. It will become impractical for analysts to process data directly. Instead, the computer plays the necessary role of filtering between the analyst and the data. This is another reason why data mining pays particular attention to guidelines. Although it is necessary to separate the analyst and the data, it is obvious that some related tasks have been caused. There is a real danger here: the unintended pattern may mislead the analyst, which I will discuss below.
I don't think that in modern statistics the computer is not an important tool. They do, not because of the size of the data. Accurate data analysis methods, such as Bootstrap method, random test, iterative estimation method and the more suitable complex model are just as possible with the computer. The computer has greatly expanded the visual field of the traditional statistical model, and also promoted the rapid development of the new tools.
Here's a look at the possibility of distorting the unexpected pattern of data. This is related to the quality of the data. The conclusion of all data analysis relies on data quality. Gigo means garbage in, garbage out, and its references everywhere. A data analyst, no matter how clever he is, is unlikely to find gems from trash. This is especially true for large datasets, especially when it comes to finding fine, small, or off-the-norm models. When a person is looking for a model of one out of 10,000, the deviation from the second decimal place will work. An experienced person is more alert to the most common problems, but there are too many possible mistakes.
Such problems can occur at two levels. The first is the microscopic level, i.e. the personal record. For example, special attributes may be missing or incorrectly lost. I know of a case, because the digger does not know that the lost data is recorded as 99 and is treated as real data. The second is the macro level, where the entire data set is distorted by some selection mechanisms. Traffic accidents provide a good example for this. The more serious and deadly the accident, the more accurate the record, but the record of small or no-harm accidents is less accurate. In fact, a high percentage of the data is not recorded at all. This creates a distorted image-which may lead to erroneous conclusions.
Statistics rarely focus on real-time analytics, but data mining problems often require these. For example, banking matters happen every day, and no one can wait three months to get a possible fraud analysis. A similar problem occurs in a situation where the population changes over time. My team has a clear example of how the application of bank debt varies with time, competitive environment, and economic fluctuations.
  
3. Discuss

  
Data mining is sometimes a one-time experiment. This is a misunderstanding. It should also be seen as an ongoing process (
Defined when the data set is managed). Examining the data from a single point of view can explain the results, and the relevant opinion checks may be closer and so on. The point is that, in rare cases, it is very rare to know which type of pattern is meaningful. The essence of data mining is to discover unexpected patterns-and the same unintended patterns to be discovered in unexpected ways.
It is associated with the idea of data mining as a process to recognize the novelty of the results. Many of the results of data mining are what we expect-and can be recalled. However, the fact that can be explained does not negate the value of digging out them. Without these experiments, this might not have occurred at all. In fact, only the structures that can be justified by past experience will be valuable.
There is obviously a potential opportunity for data mining. The likelihood of discovering patterns in a big data set is certainly there, and the number of large datasets is increasing. It should not, however, cover up the danger. All real datasets, even those that are collected in a fully automated fashion, have the potential to produce errors. This is especially true for people's datasets, such as transactional and behavioral data. This explains very well that most of the "unintended structures" found in the data are meaningless in nature, but because they deviate from the ideal process. (Of course, such a structure might make sense: if there are problems with the data, it may interfere with the purpose of collecting the data, and it is better to understand them). What is associated with this is how to ensure (and at least support the fact) that any observed patterns are "real", they reflect some potential structures and associations rather than just a particular data set, because a random sample happens to happen. Here, the scoring method may be relevant, but requires more research by statisticians and data mining workers.

What is data mining?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.