Mining of massive datasets-Data Mining

Source: Internet
Author: User
Tags natural logarithm idf
1 What is data mining?

The most commonly accepted definition of "Data Mining" is the discovery"Models" for Data.

 

1.1 statistical modeling

Statisticians were the first to use the term "data mining ."

Now, statisticians view data mining as the construction ofStatistical Model, That is,Underlying Distribution(EX. Gaussian distribution) from which the Visible data is drawn.

 

1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriatelyUses algorithms from machine learning.

Machine-learning practitioners use the data asTraining set, To train an algorithm of one of the specified types used by machine-learning practitioners, suchBayes nets,Support-Vector Machines,Demo-trees,Hidden Markov models, And other others.

The typical case where machine learning is a good approach is when we haveLittle idea of what we are lookingIn the data.

For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.

Machine Learning is suitable for such unclear rules that can be mining. Therefore, you only need to feed the data to the ml algorithm and it can make judgments for you, instead of worrying about the specific process.

 

1.3 computational approaches to Modeling

I talked about two models before. How can I use the discovery model?

There are using different approaches to modeling data. Here we will introduce two types,

1.SummarizingThe data succinctly and approximately

2.ExtractingThe most prominentFeaturesOf the data and ignoring the rest

The following describes the two methods.

 

1.4 Summarization

One of the most interesting forms of summarization isPageRankIdea, which made Google successful. The entire complex structure of the web is summarized by a single number for each page.

Another important form of summary-Clustering. The book presents an example of 'plow.cholera cases on a map of London '. Through a simple and manual ployer, We can model point-by-row clustering, mining out rules that are more likely to get sick near intersections.

 

1.5 Feature Extraction

A complex relationship between objects is represented by findingStrongest statistical DependenciesAmong these objects and using only those in representing all statistical connections.

 

2 Statistical Limits on Data Mining

A common sort of data-mining problem involvesDiscovering unusual events hiddenWithin massive amounts of data.

However, data mining technology is not always effective. Here we will introduce bonferroni's principle to avoid misuse of this technology.

 

2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it cocould find, including credit-card receipts, hotel records, travel data, and other kinds of information in order to track terrorist activity.

Of course, the Bush plan was eventually rejected by Parliament due to privacy issues, but here is just an example to discuss whether data mining technology is effective.

 

2.2 bonferroni's Principle

Calculate the expectedNumber of occurrences of the eventsYou are looking for, on the assumption that data is random. If this number isSignificantly largerThan the numberReal instances you hope to find, Then you must should CT almost anything you find to be bogus.

In a situation like searching for terrorists, where we have CT that there are few terrorists operating at any one time.
If we use data mining technology to mine a large number of terrorist events every day, such technology is ineffective, even if there are indeed several terrorist events...

3 things useful to know

If you are studying data mining, the following basic concepts are very important,

1. The Tf. IDF measure of word importance.
2. Hash Functions and their use.
3. Secondary storage (Disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. power laws.

 

3.1 importance of words in documents

In several applications of data mining, we shall be faced with the problem of categorizing parameters (sequences of words) by their topic. typically, topics are identified by finding the special words that characterize documents about that topic.

This is a typical data mining problem... topic keywords Extraction

The most basic technology is TF. IDF (term frequency times inverse Document Frequency ).

Word FrequencyTerm Frequency (TF) refers to the number of times a given word appears in the file. This number is often normalized to prevent it from being biased towards long files.

Reverse file frequency(Inverse Document Frequency, IDF) is a measure of the general importance of words. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the quotient

 

3.5 The base of natural logarithms (natural logarithm)

The constant E = 2.7182818 · has a number of useful special properties. In particle, e is the limit of (1 + 1/x) xAs X goes to infinity.

E is very useful. I don't know how to learn mathematics. here we can use e to simplify the computation,

1. Consider (1 + a) B, where A is small, we can thus approximate (1 + a) B as EAB.

2. Ex = 1 + x + x2/2 + X3/6 + X4/24 + ···

 

3.6 power laws (Power Law)

There are waiting phenomena that relate two variables byPower Law, That is, a linear relationship between the logarithms of the variables.

 

This chapter is a summary of the nature,

I talked about what data mining is, What are common methods and ideas, What are the limitations of mining technology, and common basic concepts.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.