1 What is data mining?
The most commonly accepted definition of "Data Mining" is the discovery"Models" for Data.
1.1 statistical modeling
Statisticians were the first to use the term "data mining ."
Now, statisticians view data mining as the construction ofStatistical Model, That is,Underlying Distribution(EX. Gaussian distribution) from which the Visible data is drawn.
1.2 Machine Learning
There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriatelyUses algorithms from machine learning.
Machine-learning practitioners use the data asTraining set, To train an algorithm of one of the specified types used by machine-learning practitioners, suchBayes nets,Support-Vector Machines,Demo-trees,Hidden Markov models, And other others.
The typical case where machine learning is a good approach is when we haveLittle idea of what we are lookingIn the data.
For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.
Machine Learning is suitable for such unclear rules that can be mining. Therefore, you only need to feed the data to the ml algorithm and it can make judgments for you, instead of worrying about the specific process.
1.3 computational approaches to Modeling
I talked about two models before. How can I use the discovery model?
There are using different approaches to modeling data. Here we will introduce two types,
1.SummarizingThe data succinctly and approximately
2.ExtractingThe most prominentFeaturesOf the data and ignoring the rest
The following describes the two methods.
1.4 Summarization
One of the most interesting forms of summarization isPageRankIdea, which made Google successful. The entire complex structure of the web is summarized by a single number for each page.
Another important form of summary-Clustering. The book presents an example of 'plow.cholera cases on a map of London '. Through a simple and manual ployer, We can model point-by-row clustering, mining out rules that are more likely to get sick near intersections.
1.5 Feature Extraction
A complex relationship between objects is represented by findingStrongest statistical DependenciesAmong these objects and using only those in representing all statistical connections.
2 Statistical Limits on Data Mining
A common sort of data-mining problem involvesDiscovering unusual events hiddenWithin massive amounts of data.
However, data mining technology is not always effective. Here we will introduce bonferroni's principle to avoid misuse of this technology.
2.1 Total Information Awareness
In 2002, the Bush administration put forward a plan to mine all the data it cocould find, including credit-card receipts, hotel records, travel data, and other kinds of information in order to track terrorist activity.
Of course, the Bush plan was eventually rejected by Parliament due to privacy issues, but here is just an example to discuss whether data mining technology is effective.
2.2 bonferroni's Principle
Calculate the expectedNumber of occurrences of the eventsYou are looking for, on the assumption that data is random. If this number isSignificantly largerThan the numberReal instances you hope to find, Then you must should CT almost anything you find to be bogus.
In a situation like searching for terrorists, where we have CT that there are few terrorists operating at any one time.
If we use data mining technology to mine a large number of terrorist events every day, such technology is ineffective, even if there are indeed several terrorist events...
3 things useful to know
If you are studying data mining, the following basic concepts are very important,
1. The Tf. IDF measure of word importance.
2. Hash Functions and their use.
3. Secondary storage (Disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. power laws.
3.1 importance of words in documents
In several applications of data mining, we shall be faced with the problem of categorizing parameters (sequences of words) by their topic. typically, topics are identified by finding the special words that characterize documents about that topic.
This is a typical data mining problem... topic keywords Extraction
The most basic technology is TF. IDF (term frequency times inverse Document Frequency ).
Word FrequencyTerm Frequency (TF) refers to the number of times a given word appears in the file. This number is often normalized to prevent it from being biased towards long files.
Reverse file frequency(Inverse Document Frequency, IDF) is a measure of the general importance of words. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the quotient
3.5 The base of natural logarithms (natural logarithm)
The constant E = 2.7182818 · has a number of useful special properties. In particle, e is the limit of (1 + 1/x) xAs X goes to infinity.
E is very useful. I don't know how to learn mathematics. here we can use e to simplify the computation,
1. Consider (1 + a) B, where A is small, we can thus approximate (1 + a) B as EAB.
2. Ex = 1 + x + x2/2 + X3/6 + X4/24 + ···
3.6 power laws (Power Law)
There are waiting phenomena that relate two variables byPower Law, That is, a linear relationship between the logarithms of the variables.
This chapter is a summary of the nature,
I talked about what data mining is, What are common methods and ideas, What are the limitations of mining technology, and common basic concepts.