Mining of massive datasets-Data Mining

Last Update:2018-12-05 Source: Internet

Author: User

Tags natural logarithm idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 What is data mining?

The most commonly accepted definition of "Data Mining" is the discovery"Models" for Data.

1.1 statistical modeling

Statisticians were the first to use the term "data mining ."

Now, statisticians view data mining as the construction ofStatistical Model, That is,Underlying Distribution(EX. Gaussian distribution) from which the Visible data is drawn.

1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriatelyUses algorithms from machine learning.

Machine-learning practitioners use the data asTraining set, To train an algorithm of one of the specified types used by machine-learning practitioners, suchBayes nets,Support-Vector Machines,Demo-trees,Hidden Markov models, And other others.

The typical case where machine learning is a good approach is when we haveLittle idea of what we are lookingIn the data.

For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.

Machine Learning is suitable for such unclear rules that can be mining. Therefore, you only need to feed the data to the ml algorithm and it can make judgments for you, instead of worrying about the specific process.

1.3 computational approaches to Modeling

I talked about two models before. How can I use the discovery model?

There are using different approaches to modeling data. Here we will introduce two types,

1.SummarizingThe data succinctly and approximately

2.ExtractingThe most prominentFeaturesOf the data and ignoring the rest

The following describes the two methods.

1.4 Summarization

One of the most interesting forms of summarization isPageRankIdea, which made Google successful. The entire complex structure of the web is summarized by a single number for each page.

Another important form of summary-Clustering. The book presents an example of 'plow.cholera cases on a map of London '. Through a simple and manual ployer, We can model point-by-row clustering, mining out rules that are more likely to get sick near intersections.

1.5 Feature Extraction

A complex relationship between objects is represented by findingStrongest statistical DependenciesAmong these objects and using only those in representing all statistical connections.

2 Statistical Limits on Data Mining

A common sort of data-mining problem involvesDiscovering unusual events hiddenWithin massive amounts of data.

However, data mining technology is not always effective. Here we will introduce bonferroni's principle to avoid misuse of this technology.

2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it cocould find, including credit-card receipts, hotel records, travel data, and other kinds of information in order to track terrorist activity.

Of course, the Bush plan was eventually rejected by Parliament due to privacy issues, but here is just an example to discuss whether data mining technology is effective.

2.2 bonferroni's Principle

Calculate the expectedNumber of occurrences of the eventsYou are looking for, on the assumption that data is random. If this number isSignificantly largerThan the numberReal instances you hope to find, Then you must should CT almost anything you find to be bogus.

In a situation like searching for terrorists, where we have CT that there are few terrorists operating at any one time.
If we use data mining technology to mine a large number of terrorist events every day, such technology is ineffective, even if there are indeed several terrorist events...

3 things useful to know

If you are studying data mining, the following basic concepts are very important,

1. The Tf. IDF measure of word importance.
2. Hash Functions and their use.
3. Secondary storage (Disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. power laws.

3.1 importance of words in documents

In several applications of data mining, we shall be faced with the problem of categorizing parameters (sequences of words) by their topic. typically, topics are identified by finding the special words that characterize documents about that topic.

This is a typical data mining problem... topic keywords Extraction

The most basic technology is TF. IDF (term frequency times inverse Document Frequency ).

Word FrequencyTerm Frequency (TF) refers to the number of times a given word appears in the file. This number is often normalized to prevent it from being biased towards long files.

Reverse file frequency(Inverse Document Frequency, IDF) is a measure of the general importance of words. The IDF of a specific word can be obtained by dividing the total number of files by the number of files containing the word, and then obtaining the quotient

3.5 The base of natural logarithms (natural logarithm)

The constant E = 2.7182818 · has a number of useful special properties. In particle, e is the limit of (1 + 1/x) xAs X goes to infinity.

E is very useful. I don't know how to learn mathematics. here we can use e to simplify the computation,

1. Consider (1 + a) B, where A is small, we can thus approximate (1 + a) B as EAB.

2. Ex = 1 + x + x2/2 + X3/6 + X4/24 + ···

3.6 power laws (Power Law)

There are waiting phenomena that relate two variables byPower Law, That is, a linear relationship between the logarithms of the variables.

This chapter is a summary of the nature,

I talked about what data mining is, What are common methods and ideas, What are the limitations of mining technology, and common basic concepts.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More