Discussion on Pattern Recognition Technology

Source: Internet
Author: User
Tags network function svm

Discussion on Pattern Recognition Technology(1)

------ Introduction

In the field of Artificial Intelligence (Artificial Intelligence), Pattern Recognition is perhaps the most challenging technology. Pattern Recognition is also called classification technology, because pattern recognition is to classify data. When it comes to recognition, the most common thing is to imitate human visual image recognition (and voice recognition, of course). Maybe you will think that is not simple, we feel that we can use our eyes to easily identify various things, but when you want to use a program in a computer to implement it, you will feel very frustrated, even if you have no idea how clever your computer is, how low your computer is. Yes, today's computer intelligence, that is, artificial intelligence, is far inferior to cockroaches. The most fundamental reason is that pattern recognition technology is still in a relatively low-level development stage, many identification technologies cannot be broken through, and some even assert that there will be no essential leap in 30 years. Of course, the world is always unpredictable, and we don't need to be so pessimistic. Science and technology are always moving forward, so no one can block it. Here, I will share with you my learning and research experiences on the pattern recognition technology. My only purpose is to let the pattern recognition technology go beyond the technical altar, so that everyone can learn about it and want more people to study it. My knowledge and ability are limited. This may also help me correct my mistakes.

 

Pattern recognition has a long history. Before 1960s, pattern recognition was mainly limited to theoretical research in the field of statistics and could not be supported by strong mathematical theories, in 1980s, neural networks and other recognition technologies have made breakthroughs, and computer hardware technology has made great strides. pattern recognition technology has been widely used. optical character recognition (OCR) it is the first technology that successfully applies pattern recognition technology, subsequent applications include DNA sequence analysis, chemical odor recognition, image comprehension, face detection, facial expression recognition, gesture recognition, speech recognition, image information retrieval, and data mining.

 

Pattern recognition is a science that closely integrates with mathematics. It applies a lot of mathematical knowledge. The most basic thing is probability theory and mathematical statistics, pattern recognition technology is full of probability and statistical ideas. The recognition rate we often call is actually the expression of Probability: in large data volumes (strictly speaking, it should be an infinite data volume) the probability of successful identification in the test, and the Bayesian decision classifier is commonly used to apply the probability formula. Pattern recognition also uses linear algebra, because linear algebra can easily express things with multiple features, we generally use vectors to express the characteristics of a thing, linear Algebra is certainly used for vector calculation. Another high-level mathematical knowledge is functional analysis, which is used to study the extensive functions and operator theories in an infinite-dimensional linear space. SVM (SVM) it is based on the theory in functional analysis. SVM technology also applies the mathematical knowledge of optimization theory, recently, the multi-dimensional space bionic Pattern Recognition Technology proposed by Academician WANG shoujue of the Chinese Emy of Sciences is based on topology theory. Therefore, pattern recognition science is one of the most widely applied disciplines in mathematics. When we study pattern recognition technology, we will encounter one mathematical knowledge after another, sometimes we need to pick up the mathematics books in college to learn, and sometimes we need to find and learn the mathematical knowledge we may have never learned, at this time, you will feel that you are actually doing research, as if you have been back to college, and you will feel that it will take years to learn pattern recognition technology well, and you will be impetuous. Of course, the more you stick to it, the more value you will have, because it is a technology that can be accumulated constantly. Unlike studying upper-layer applications, it does not mean that you will be more powerful after years of research, if you do not follow up, you will be eliminated, and the people who come to study later will easily surpass those who have previously studied. Therefore, pattern recognition is a good choice for people who like to do research.

 

Pattern Recognition Technology (2)

------- A large number of probability and statistical analysis methods are applied

Pattern recognition can be divided into statistical pattern recognition and syntactic pattern recognition. Statistical Pattern Recognition is used to collect statistics or learn a large number of samples and finally gets a classifier, bayesian classifier, neural network, SVM, and k-nn are all Statistical Pattern Recognition Methods. Syntactic Pattern Recognition is based on certain logic rules, such as shape judgment, syntax type judgment, and address subdivision. Syntactic Pattern Recognition can also be called structure pattern recognition. It is generally used in identification applications with clear logic and difficulty in obfuscation, the recognition method is also relatively simple, so most of the current research is the method of statistical pattern recognition, and the research here focuses on machine learning, because people believe that like humans to identify new things, they all need a learning process. Computers can also learn and recognize computers as humans do. Neural network technology is based on imitating human learning. I want to express the importance of statistical methods in pattern recognition. In this section, we will mainly discuss the application of probability theory and statistics in pattern recognition.

 

When it comes to probability and statistics, we have to mention Bayesian decision-making theory. It is a basic statistical method to solve the pattern classification problem. The basic formula of Bayesian decision-making theory can be described as follows:

 

Probability of a feature being determined as a certain type=

Probability of this feature in this class*Probability of this class/Probability of this feature

 

The above formula is the derivation of a conditional probability formula. It is described in words for better understanding. To learn more about this, you can find a theoretical book on pattern recognition, the first part of almost every theoretical book is to describe this aspect. I have read Dr. Lang Xianping's lecture before. I am very impressed with one sentence. It generally means that successful business people are always choosing to make their chances of success, instead of taking risks to speculate for a small probability. The basic principle of Bayes is to select a high probability judgment. Under a certain feature condition, which category has a high probability will be judged as that category, so as to minimize the error rate. The actual application scenario is much more complex. In a variety of features and multiple types of applications, formulas also become very complex, and many parameters need to be analyzed statistically, the process of using Bayesian decision-making theory is basically a process of calculating probability and statistical analysis. Here we have a basic starting point: all statistics must be in the case of large data volumes, because probability has a precondition, that is, in the case of a large amount of data, statistical pattern recognition is inseparable from the precondition of a large amount of data. The sample size used for analysis must be sufficiently large, otherwise, it is very likely that the final result will be abandoned.

 

Probability applications are also commonly used in theory: Markov model and Markov model (HMM). This is one of the basic theoretical tools in Word Segmentation technology and speech recognition, word Frequency Statistics are required for basic statistics. Both the Markov model and the stable Markov model are the applications of Multi-condition probability, and they are also pursuing high probability results. The Markov model can also be divided into the first-order Markov Model (Bigram model), the second-order Markov Model (Trigram model), and the n-gram model. The larger the order, the more data you need to calculate, the greater the computing complexity. HMM uses the forward computing method (Viterbi algorithm), which greatly reduces the computing complexity. Therefore, HMM is widely used. Today's speech recognition algorithms are implemented Using HMM theory models.

 

Statistical analysis has a covariance matrix, which can be applied to PCA (Principal Component Analysis) dimensionality reduction method. It is easy to understand that when there are more features, the more complicated the computation, and the lower the accuracy of the computation results, we always have to find ways to reduce the feature dimension, the common method is to use the PCA Dimensionality Reduction Method (another method, VQ, is also a good Dimensionality Reduction Method). This method uses a large number of sample statistics to calculate the features with the least variance, the smaller the variance, the more confusing the features are, the less helpful the classification is. Therefore, we can remove these features to reduce the feature dimension.

 

The machine learning method similar to neural networks is also a statistical pattern recognition method. The machine learning method greatly simplifies our statistical workload on sample data, an automatic method is used to generate a classifier based on a large number of samples. In this method, the application of statistical analysis is so stable that you cannot recognize it as a statistical pattern recognition method, however, learning a large number of samples can also be considered as a statistical method. For example, the formation of coefficients of each neural node in a neural network is based on a certain algorithm (such as the LMS Algorithm) after a large number of samples are corrected, this correction process can also be regarded as a statistical analysis process.

 

Since pattern recognition technology is inseparable from probability and statistical analysis, before designing a classifier, you must first prepare a large number of comprehensive training samples and test samples that can cover various situations, then conduct statistical analysis on the training samples, analyze the characteristics of the samples, analyze the feature value distribution of the samples, obtain various statistical data, and finally determine the pattern recognition method, the test sample is used to check the rationality of the classifier. According to the problem tested by the test sample, the classifier needs to be modified. This is an iterative process until the performance goal of the classifier is reached.

Pattern Recognition Technology (3)

------- High-dimensional space

 

When we represent features of a thing, there are generally more than three or even hundreds of features. For convenience, feature values are usually expressed in the form of vectors, therefore, when studying pattern recognition, there will be a lot of matrix operations. For the calculation of feature values, we can think of it as an operation in a high-dimensional space, matrix Operations can easily express operations in high-dimensional spaces. Therefore, linear algebra is the mathematical basis for studying pattern recognition. A higher level of mathematical theory is functional analysis, it is used to study the geometry and Analysis of an infinite dimension space.

 

We can easily imagine the space below three dimensions, but the space above three dimensions is beyond our perception capability. Many of them are used in the calculation of space below three dimensions and extended to high-dimensional space, the so-called "dimension disaster" occurs because of the sparsity and empty space in the high-dimensional space, that is, the data distribution in the high-dimensional space is very sparse, in addition, there may be a point in an empty area with a high density. The dimension disaster was first proposed by Bellman. It generally refers to a series of problems caused by too many variables in data analysis, it is a bit like an exponential explosion. As the index increases, the data will rapidly expand to an unimaginable size.

SVM Pattern Recognition uses the kernel method to transform in high-dimensional space, cleverly solving the problem of dimension disaster. Therefore, many experiments have shown that SVM classification algorithms are always better than other classification algorithms. Although there is such a good way, we still need to reduce the dimension and reduce the dimension, which not only reduces the complexity of computing, but also eliminates unnecessary interference features, among the many features, some features may be useless, that is, there may be features that are not features. Removing these useless features can improve the performance of the classifier, currently, the main method used to reduce the dimension is the PCA method. Many people always use the covariance matrix to describe this method, which leads to a lot of formula derivation, in fact, the core idea is to exclude those features with the smallest variance. If you know this, you can use the PCA method directly by counting the sample feature variance without considering the covariance matrix.

 

The distance between the two sets of features can be expressed in many ways, such as Euclidean distance, absolute distance, cherbihov distance, Markov distance, Euclidean distance, similarity coefficient, and qualitative index distance, we are familiar with the Euclidean distance. In fact, this distance is not commonly used in high-dimensional space, not only because of the large computing volume, but also because of the different feature values, the calculation unit is different, we cannot treat each type of feature equally. The distance calculation method used in pattern recognition is very important and is related to the design of classifier. The distance calculation method needs to be flexibly applied according to the actual situation. Sometimes you can design the distance calculation method by yourself, as long as the four conditions of distance are met:

1. The distance is equal to 0 only when two points are duplicated;

2. The distance value must be greater than or equal to 0;

3. Symmetry: the distance obtained from point A to point B is equal to the distance obtained from point B to point;

4. Triangle Inequality: in the triangular distance relationship formed by three points, the sum of any two sides is greater than the third side.

Pattern Recognition Technology (4)

------------ About Machine Learning

When talking about machine Learning, you must first think of neural networks. In fact, there are many machine Learning methods. Here we use Learning OpenCV "(GThe table on Machine Learning Algorithm summary in chapter 2 of ary Bradski and Adrian Kaehler describes the algorithms used in Machine Learning:

ML Algorithm

Comment

Mahalanobis

A distance measure that accounts for the "stretchiness" of the data space by dividing out the covariance of the data. if the covariance is the identity matrix (identical variance), then this measure is identical to the Euclidean distance measure.

 

K-means

An unsupervised clustering algorithm that represents a distribution of data usingKCenters, whereKIs chosen by the user. the difference between this algorithm and expectation maximization is that here the centers are not Gaussian and the resulting clusters look more like soap bubbles, since centers (in effect) compete to "own" the closest data points. these cluster regions are often used as sparse histogram bins to represent the data. ted by Steinhaus [Steinhaus56], as used by Lloyd.

 

Normal/Na keep ve Bayes classifier

 

A generative classifier in which features are assumed to be Gaussian distributed and statistically independent from each other, a strong assumption that is generally not true. for this reason, it's often called a "na has ve Bayes" classifier. however, this method often works surprisingly well. original mention.

 

Demo-trees

 

A discriminative classifier. the tree finds one data feature and a threshold at the current node that best divides the data into separate classes. the data is split and we recursively repeat the procedure down the left and right branches of the tree. though not often the top timer mer, it's often the first thing you should try because it is fast and has high functionality.

 

Boosting

A discriminative group of classifiers. the overall classification demo-is made from the combined weighted classification decisions of the group of classifiers. in training, we learn the group of classifiers one at a time. each classifier in the group is a "weak" classifier (only just above chance performance ). these weak classifiers are typically composed of single-variable demo-trees called "stumps ". in training, the demo-stump learns its classification decisions from the data and also learns a weight for its "vote" from its accuracy on the data. between training each classifier one by one, the data points are re-weighted so that more attention is paid to data points where errors were made. this process continues until the total error over the data set, arising from the combined weighted vote of the demo-trees, falls below a set threshold. this algorithm is often valid tive when a large amount of training data is available.

 

Random trees

 

A discriminative forest of your demo-trees, each built down to a large or maximal splitting depth. during learning, each node of each tree is allowed to choose splitting variables only from a random subset of the data features. this helps ensure that each tree becomes a statistically independent demo-maker. in run mode, each tree gets an unweighted vote. this algorithm is often very valid tive and can also perform regression by averaging the output numbers from each tree.

Face detector/

Haar classifier

 

An object detection application based on a clever use of boosting. the OpenCV distribution comes with a trained frontal face detector that works remarkably well. you may train the algorithm on other objects with the software provided. it works well for rigid objects and characteristic views.

 

Expectation maximization (EM)

A generative unsupervised algorithm that is used for clustering. It will fitNMultidimen1_gaussians to the data, whereNIs chosen by the user. this can be an effective way to represent a more complex distribution with only a few parameters (means and variances ). often used in segmentation. compare with K-means listed previusly.

K-nearest neighbors

The simplest possible discriminative classifier. training data are simply stored with labels. thereafter, a test data point is classified according to the majority vote of its K nearest other data points (in a Euclidean sense of nearness ). this is probably the simplest thing you can do. it is often valid tive but it is slow and requires lots of memory.

 

Neural networks/

Multilayer perceptron (MLP)

 

A discriminative algorithm that (almost always) has "den units" between output and input nodes to better represent the input signal. it can be slow to train but is very fast to run. still the top timer MER for things like letter recognition.

 

Support vector machine (SVM)

 

A discriminative classifier that can also do regression. A distance function between any two data points in a higher-dimen1_space is defined. (Projecting data into higher dimensions makes the data more likely to be linearly separable .) the algorithm learns separating hyperplanes that maximally separate the classes in the higher dimension. it tends to be among the best with limited data, losing out to boosting or random trees only when large data sets are available.

 

If you want to learn so many algorithms well, using the OpenCV open-source program is indeed a good choice. Of course, before learning an algorithm, you still have to study the relevant theoretical knowledge carefully. Otherwise, you may not be able to understand the implementation code.

 

Here is a theory of "no free lunch". We propose that there is no best algorithm. Each algorithm has its advantages and disadvantages, that is, we should not be too superstitious about the absolute advantages of an algorithm, when using an algorithm, you must understand the "gain" and "loss". Therefore, a single algorithm cannot meet the actual needs. Multiple algorithms are often used to improve the recognition performance. Another "Razor principle" is to try not to complicate the problem. We should try our best to eliminate useless factors that will complicate the problem. The same applies to recognition algorithms, the more complex the algorithm is, the more useful it is. Sometimes simple algorithms can achieve better performance.

 

In machine learning, most of the algorithms are used to find (or fit) a separation curve. If it is linearly divided, the separation line is a straight line (if it is a high-dimensional space, it is a hyperplane ), if it is non-linear, the separation line is a curve (if it is a high-dimensional space, it is a supersurface). If you understand this, you may have a good understanding of why every node in a neural network uses a function that is mostly a positive cosine function: using the Fourier transformation principle, we can know that using the positive cosine function to fit a curve is the best choice.

 

For R & D of pattern recognition algorithms, I personally think that innovative thinking and solid technical accumulation are the most important. for the development of a certain recognition technology, there may be many identification methods for you to choose from, there may be no way to suit you. In this case, you need to apply an innovative recognition method, or even generate a new unique recognition method. With the established Algorithm Direction, the rest is the long-term technical accumulation. To apply a good recognition algorithm to a usable product, a long-term improvement process is generally required, any impetuous and short-sighted mentality cannot succeed.

Basic Principles of Neural Networks

 

Learning Pattern Recognition I personally think it may be a good choice to start with neural networks. On the one hand, it can avoid getting into complicated formula deduction at once, on the other hand, we can quickly experience what pattern recognition is like, because we can use matlat or openCV for practical use (learning a technology, practice more is very helpful for understanding theoretical knowledge ). Neural Network Technology considers pattern recognition technology from the perspective of bionic, and it has always been the goal of scientific research to explore the intelligence that imitates humans. neural network technology is based on this, however, neural networks can be applied to solve mathematical problems. The Gradient Descent Method in the optimization theory is the core of the neural network implementation principle, and the gradient descent algorithm is a cyclical computing process:

1. Select an initial value for the algorithm model parameter values or randomly select some initial values;

2. Calculate the gradient of the loss function corresponding to each parameter;

3. Change the parameter value based on the gradient value to make the error value smaller;

4. Repeat steps 2 and 3 until the gradient value is close to 0.

 

The neural network method is to use training samples for learning to fit a split line (for dimension recognition, it is a plane or a surface, and for 3D recognition, it is a superplane or a supersurface ), if the split line is a straight line (or plane, or a superplane), it is called a linear neural network. Otherwise, it is a non-linear neural network. The linear neural network is better understood and understands the linear neural network, for non-linear neural networks, it is easier to understand. So here we use linear neural networks as an example to explain the principles of neural networks. It is a two-dimensional feature distribution chart with a straight line in the middle being a split line, we are now concerned about how this split line is calculated. If we have learned mathematics, we know that we can use the least square method to calculate it, but here we will use the neural network learning method to learn it:

 

 

As we can know, as long as we can get the values of w1, w2, and B, we can find this line. Accordingly, we construct the neural network topology shown below:

 

 

In w1 and w2, we call them weights and B as thresholds. the learning process of neural networks is to constantly adjust weights and thresholds until the minimum error rate is reached, for linear neural networks, we can use the LMS algorithm, that is, the minimum mean deviation algorithm, to obtain the weights and thresholds. The following is the description of the LMS algorithm:

Principle: The mean variance is minimized by adjusting the weight (w) and threshold (B) of the linear neural network.

A sample set is known: {p1, t1}, {p2, t2}, {p3, t3 }...... {Pn, tn}. (If the sample feature value is multidimensional, p is a vector expression)

Returns the mean variance: mse = sum (e (I) 2)/n = sum (t (I)-a (I) 2/n, where I = 1 ~ N, a (I) = pi * w + B

Assume that step k has obtained the weight gradient (Gw) and threshold gradient (G B) respectively, the weights and thresholds of step k + 1 are:

W (k + 1) = w (k)-Gw * α;

B (k + 1) = B (k)-G B * α; α indicates the learning rate.

The next step is how to calculate the gradient. If the variation of the weight and threshold can minimize the mean variance, we can achieve our goal, therefore, we can use the mean variance formula to determine the deviation between the weight value and the threshold value. This deviation value is the gradient value we want. It reflects the relationship between the variation of the weight or threshold value and the mean variance, the evolution (derivation) of the deviation formula is as follows:

When e2 (I)/then w = 2e (I) * Then e (I)/then w = 2e (I) * Then (t (I)-a (I )) /Jun w = 2e (I) * Jun [t (I)-(w * p + B)]/Jun w

=-2e (I) * p;

 

When e2 (I)/then B = 2e (I) * Then e (I)/then B = 2e (I) * Then (t (I)-a (I )) /Then B = 2e (I) * Then [t (I)-(w * p + B)]/then B

=-2e (I );

The average difference value in step k is expressed as: e (k) = sum (e (I)/n;

Then, we can get the equation of variation between the weight and threshold:

W (k + 1) = w (k)-Gw * α = w (k) + 2 * e (k) * p * α;

B (k + 1) = B (k)-G B * α = B (k) + 2 * e (k) * α;

 

In fact, the neural network described above is a single-layer neural network, as early as 1969, M. minsky and S. in his book "Perception Machine", Papert conducts an in-depth analysis on the single-layer neural network, and proves that the network function is limited in mathematics, it cannot even solve simple logic operations such as "XOR. At the same time, they also found that many models cannot be trained using a single-layer network. What makes Neural Networks widely used was the development of the BP network learning algorithm in 1985, the multi-layer network of Minsky is realized. The BP network is a multi-layer feed-forward neural network, and its neuron transmission function is S-type function (nonlinear function ), it can implement any non-linear ing from input to output. Because the adjustment of the weight uses the Back Propagation learning algorithm, it is called a Back-Propagation Network. Currently, most of the applications of artificial neural networks use BP networks and their forms of change. They are also the core part of forward networks and reflect the essence of artificial neural networks. BP neural networks can be used not only for pattern recognition, but also for function approximation and Data Compression applications.

 

The BP algorithm is very similar to the algorithm described above. It is also used to determine the direction of weight and threshold adjustment based on the mean variance, the gradient direction of the Weight Value and threshold value is obtained by performing partial guidance on the weights and threshold values respectively. The difference is that the BP neural network has several layers, starting from the output layer, the weights and threshold changes of each layer are calculated layer by layer (so it is called the inverse Propagation Learning Algorithm ), another difference is that some neurons at the network layer use log-sigmoid nonlinear functions, which need to be derived.

 

The main disadvantage of the BP algorithm is that the convergence speed is slow, there are multiple local extreme values, it is difficult to determine the number of stable layers and the number of stable layer nodes. Therefore, in practical application, the BP algorithm is difficult to be competent and needs to be improved in two ways: one is a heuristic learning algorithm (analyzing the performance function gradient to improve the algorithm), and the other is a more effective optimization algorithm (Training Algorithm Based on the numerical optimization theory ). Heuristic learning algorithms include the Gradient Descent Method with momentum, gradient descent method with adaptive lr, gradient descent method with momentum and adaptability, and BP training method with energy resetting, these algorithms are based on the optimization theory: the gradient method, Gaussian-Newton method, and Levenberg-Marquardt method. These improved algorithms can be found in matlab, matlab provides a wide range of neural network algorithms, in addition to BP neural networks, as well as radial basis function-based neural networks (such as Generalized Regression Neural Networks and probabilistic neural networks) feedback-type neural networks (such as the CNN and Elman neural networks) and Competitive Neural Networks (such as self-organizing Feature ing Neural Networks and learning vector quantization neural networks, matlab is a very good tool. If you want to see the specific implementation method, openCV provides the implementation of the BP algorithm. Unfortunately, currently, openCV only implements the BP algorithm, we hope that more neural network algorithms can be implemented in openCV.

 

For a neural network, it cannot be too superstitious. for a variety of samples and multiple neural network nodes, the convergence speed of the neural network will be slow, resulting in a long learning time, due to the existence of multiple local extreme points, different initial values and different learning samples may have different learning effects. Therefore, it is often necessary to study multiple times to achieve better results. According to the complexity of the problem, designing an appropriate neural network topology is also a very difficult problem. Neural networks are a result of human imitation of the principles of biological neural networks, but they are far from being able to achieve the functions of biological neural networks. Today's artificial intelligence technology is even inferior to cockroaches, and it is not as good as a small ant, there is still a long way to go in the research of artificial intelligence technology.

Source:Http://blog.csdn.net/dznlong/category/383067.aspx

Http://www.cnblogs.com/skyseraph/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.