The idea of self-taught machine learning is really because of my interest in data mining, because in my heart I have always believed in the logic that there is a certain pattern behind everything, and that different situations only correspond to certain conditions. So to find such a pattern is the most convenient and quickest way to solve a class of problems, as a lazy person like me, of course, I would like to use the most efficient way to solve the problem.

Next, I intend to write a series of data mining, mainly corresponding to their recent reading of the "Introduction to Data Mining" this book, on the one hand to write some reading notes, on the other hand also combined with their own thinking to write some feelings. I have a general look, the book involved in the algorithm is basically in the previous machine learning series have been written, so through the book's Learning hope to master the actual data mining the entire process and the application of various algorithms, and then also hope and everyone to communicate, progress together, do a qualified small miner ~

OK, off-topic to this side, the following into the subject, today mainly want to talk about the concept of data mining and some of the content of the data, relatively speaking, the theoretical content is too much, but deepen the understanding of these things will let you do data mining more purposeful.

==================================================================== Data Mining

What does data mining really do?

The more official definition is the process of discovering useful information automatically in a large data repository. In fact, as I said before, from a lot of data, find the pattern that we want to find.

The general process of data mining includes the following aspects:

1. Data preprocessing

2. Data mining

3. Post-processing

First of all, data preprocessing. The reason for this is that the usual data mining needs to involve a relatively large amount of data, which may vary from source to format, and perhaps some of the missing or invalid values of the data, if not processed directly by the ' Dirty ' data is put in our model to run, it is very easy to cause the failure of model calculation or poor usability, so data preprocessing is an indispensable step in all our data mining processes. It's a good thing to say that preprocessing is usually a lot of time in our data mining process, but it's really worth it, and we'll talk about it in more detail below.

Data mining and post-processing are relatively easy to understand. After the preprocessing of the data, we usually construct the characteristic and then put it into the specific model to calculate, use a certain criterion to judge the performance of different models or combinatorial models, and finally determine the most suitable model for our post-processing. The post-processing process is equivalent to the pattern we've discovered that we want to find, and we'll apply it or represent it in the right way.

Finally, let's talk about the task of data mining.

I have always used a word to represent the goal of data mining, which is ' pattern '. So specifically, what does it mean?

What we call a predictive task.

This gives us a definite target attribute, which allows us to predict another specific attribute of the target. If the attribute is discrete, we often call it ' classification ', and if the target attribute is a continuous value, we call it ' regression '.

The other is what we call a description task.

This means that we identify the potential contact patterns between the data. For example, there are strong correlations between the two data, and here you have to mention the story of the beer diapers that big data often tells, and by analyzing the data, we find that the men who buy diapers usually buy some beer, so the merchant can sell the two items to improve their performance, although I personally think it is a fabricated fact, But it can help to understand the strong correlation of two data. Another important thing is cluster analysis, which is a very, very frequent analysis of our daily data mining, which is designed to find closely related observation groups that can be divided into suitable categories for analysis or dimensionality without labeling. Other description tasks also have anomaly detection, which is similar to the inverse of clustering, where clustering aggregates similar data, and anomaly detection rejects points that are too far away from the group.

These are some of the concepts of data mining, including its tasks, processes, and tasks, and a clear understanding of these can help in the future of data mining in a standardized form, while maintaining a very clear purpose.

==================================================================== data

Next, let's talk about the content of the data.

First, the type of data.

A so-called data set is usually a collection of data objects, whereas a so-called data object is a description of a set of attributes that characterize the basic characteristics of an object.

First, we look at the data object, a set of description of the object's basic characteristic attributes. The so-called attribute refers to the nature or characteristics of an object, which can vary with the object or time. For the description of the attribute we need to determine according to the type of the property, the most common types of properties include the following four kinds:

1, nominal type. The value of this property is used only to distinguish between different objects, and there is no other meaning, such as a name or ID.

2, ordinal type. The value of this property provides information that determines the order of the object, such as a score or street number.

3, Interval type. The difference in the value of this attribute is meaningful, for example, Celsius or Fahrenheit.

4, the ratio type. The difference and ratio of the values of this attribute are meaningful, such as absolute temperature and age.

The nominal and ordinal types are commonly referred to as categorical attributes, and interval and ratio types are called quantitative attributes. For a nominal attribute, since it is used only as a distinction, it is possible to do any one-to-two transformations on the property, and for ordinal type, because its value contains the information of the object order, it is necessary to perform a sequence preserving transformation when transforming it. For the interval type attribute, because it only has the difference value existence meaning, therefore can carry on the arbitrary linear transformation, but the ratio type, because its ratio existence meaning, therefore he can accept the transformation is the same multiplication one number, so the ratio value does not change.

Looking again at datasets, the most important attributes of a dataset are dimensions, sparsity, and resolution, which are relatively easy to understand.

Dimension, can be understood as the number of data object properties, the higher the dimension, often means that the higher the concentration of information, but the dimension is too high, for our computational pressure, so if facing the data set dimension is too high, we need to carry out the operation of the dimension.

Sparsity, for some datasets, although many of his properties, but most of the property values are 0, we call it sparse. Sparsity is not necessarily a disadvantage, because we tend to store non-0 items, such as SVM, also known as sparse nuclear machine.

Resolution, usually refers to the difference in the scale of different often can bring different results. For example, in 10e-10 of this magnitude observation object, you can only get the matter is composed of molecules and atoms, but if you come to the magnitude of-15, found can be divided into nuclei and electrons, and then go down, and can get quarks, so you need to choose the right resolution according to your purpose.

Second, the quality of data

We all know that in the process of data acquisition, due to a variety of reasons, there will be a measurement error, the random part we often call noise, and the system part we call it pseudo-image. The existence of these errors may have a significant impact on our analytical process. So high-quality data sources are often a prerequisite for success in data mining, but if not, we can try to improve the quality of the data by some means, which is what we're going to talk about next.

Third, data preprocessing

The main purpose of data preprocessing is to improve the quality of data, so as to improve our data mining, reduce cost and improve efficiency, the main means are divided into two kinds: Select the data objects and attributes needed for analysis, create or change attributes. Next we introduce several common preprocessing methods in turn.

1 Gathering

Aggregation is easy to understand, is to put related or similar data objects together, often used in the data exploration phase, for example, you participate in a customer purchase behavior prediction of the game, then usually give you a period of time before the customer behavior, you specifically analyze each day is actually not very meaningful, You typically choose to summarize the behavior of the customer within a certain time window, or to aggregate the actions of the customer against the target product. In addition, aggregation has a function that can change the resolution of data to adapt to different purposes of data mining work.

2 sampling

Sampling means selecting certain data objects from the data set according to certain rules to deal with them. The common case is that the amount of data is very large, you are very difficult to deal with, then you will usually sample a sample amount of data used to verify the viability of their model. The most common sampling methods are usually random sampling, but if the data we are dealing with is asymmetric, then we usually have to take a stratified sample, because random sampling has the potential to overwhelm our sparse samples.

3-dimensional regression

The goal of the dimensionality reduction is to reduce the dimensions of the data set in order to reduce our computational complexity. The simplest method of dimensional regression is to remove invalid or irrelevant features. Of course, in addition, we have some mathematical methods for dimensionality reduction, such as principal component Analysis (PCA) and singular value decomposition (SVD).

4 selection of feature subsets

Dimensional regression does help us to remove some of the redundancy features, but many times the feature of redundancy is not something we can filter out by experience. At this time, on the one hand, we rely on some algorithms to calculate the importance of features to filter features, such as some of the tree algorithm. On the other hand, if the computational resources are sufficient, we can try different feature combinations to select the best feature combinations for our final data Mining task. Of course, there are algorithms that give feature weights a way to filter features, such as support vector machines.

5 Creation of features

I believe that anyone doing data mining will regard features as the most important thing in data mining, and to be honest, the right features and combinations are often more important than the so-called more advanced algorithms, which can be very intuitive and quickly improve the results of your data mining. The creation of the characteristics of course contains the above mentioned a feature selection process, in addition, sometimes we create new features, such as the existing characteristics of a certain processing, with the current eigenvalues of the square as a new feature, you can see the data and the target variable between the existence of two relations, There is the mapping of data to new space, the most common is Fourier analysis, the time spectrum of data mapped to the spectrum, you can from the chaotic data to find the law.

6 discretization and two-tuple

Discretization and two-tuple are the most common means of daily data mining. The first is discretization, for some continuous properties we can convert it to a classification attribute according to certain criteria, such as the age of a numerical attribute, we can define less than 18 for minors, less than 30 of young people, less than 50 of middle-aged, more than 50 of the elderly. The need to pay attention to this is the classification of the number of groups and criteria for classification, common with equal width and other frequency discretization, or according to the actual situation to choose. and two yuan relative to better understand, two classification attributes from needless to say, for the classification of the attributes, you can use a combination of more than two variables to represent different classification situation.

7 Variable transformation

Variable transformations involve two situations. The first is the simple numerical transformation, here as long as the property is ordinal, whether to do the transformation need to keep order, the second is standardized and standardized, normalization usually refers to your algorithm on the variable interval has certain requirements, so you need to be reduced to fall within the corresponding interval, Standardization is to avoid some of the larger values of the properties determine the result, the data is converted to a mean 0 standard deviation of 1 new variable.

Iv. Similarity and dissimilarity measurement

Similarity and dissimilarity are very important measures in data mining, especially clustering algorithm and anomaly detection, they classify the class and judge the anomaly, then we introduce some common similarity and dissimilarity measure.

The most common is the distance, there is a min Koffsky distance, defined as

Obviously R is 2 is our most commonly used Euclidean distance, R 1 is the time of the Manhattan distance, that is, the sum of the distance on each dimension, r tends to infinity, is the largest distance between the dimensions. The distance between different norms can be used as the standard of data dissimilarity, the greater the distance, the greater the dissimilarity.

There are two common similarity degrees, Jaccard coefficients and cosine similarity. The definition of Jaccard similarity is

Jaccard similarity is typically used to handle asymmetric two-element property objects, because only the sparse attributes are available, so it is effective to prevent the two attributes from being considered similar to the similarity of all samples.

Since the similarity usually falls within the range of 0 to 1, it is natural to think of a triangular function to characterize the similarity, and the cosine similarity is defined as

The point multiplication of the two vectors is divided by the modulus of two vectors, so that the cosine of the angle of the two vectors is obtained, the similarity between the two vectors coincides with the maximum of 1, and the vertical is the minimum similarity of 0.

The above is the entire content of our first article, mainly related to some of the basic concepts of data mining and about the types of data, quality, preprocessing and similarity of the content of the measurement, for our future work and the face of the object has a more intuitive understanding, A very clear understanding of the data will allow us to be a duck in the data mining process.