Brief History of the machine learning
My subjective ML timeline
Since the initial standpoint of science, technology and AI, scientists following Blaise Pascal and Von Leibniz Ponder Abou T a machine which is intellectually capable as much as humans. Famous writers like Jules
Pascal ' s machine performing subtraction and summation–1642
Machine learning is one of the important lanes of AI which are very spicy hot subject in the or industry. Companies, universities devote many resources to advance their knowledge. Recent advances in the field propel very solid results for different tasks, comparable to human performance (98.98% at Tra Ffic signs–higher than human-).
Here I would like to share a crude timeline of machine learning and sign some of the milestones by no means complete. In addition, you should add ' up to my knowledge ' to beginning of any argument in the text.
First step toward prevalent ML is proposed by Hebb , in 1949, based on a neuropsychological learning formulation . It is called Hebbian learning theory. With a simple explanation, it pursues correlations between nodes of a recurrent neural Network (RNN). It memorizes any commonalities on the network and serves like a memory later. Formally, the argument states that;
Let us assume, the persistence or repetition of a reverberatory activity (or "trace") tends to induce lasting cellular Changes that add-to-its stability .... When an axon of cell a was near enough to excite a cell B and repeatedly or persistently takes part I n firing it, some growth process or metabolic change takes place in one or both cells such that A 's efficiency, As one of the cells firing B, is increased. [1]
Arthur Samuel
In 1952 , Arthur Samuel at IBM, developed a program playing checkers . The program is able to observe positions and learn a implicit model, gives better moves for the latter cases. Samuel played so many games with the program and observed, the program is able to play better in the course of time.
With this program Samuel confuted the general Providence dictating machines cannot go beyond the written codes and learn P Atterns like Human-beings. He coined "machine learning," which he defines as;
A field of study that gives computer the ability without being explicitly programmed.
F. Rosenblatt
In 1957 , Rosenblatt ' s Perceptron is the second model proposed again with neuroscientific back Ground and it is more similar to today's ML models. It is a very exciting discovery at the time and it is practically more applicable than Hebbian's idea. Rosenblatt introduced the Perceptron with the following lines;
The Perceptron is designed-illustrate some of the fundamental properties of intelligent systems in general, without Bec Oming too deeply enmeshed in the special, and frequently unknown, conditions which hold for particular biological organism S.[2]
After 3 years later, Widrow [4] engraved Delta Learning rule That's then used as practical procedure F or Perceptron training. It is also known as Least Square problem. Combination of those, ideas creates a good linear classifier. However, Perceptron ' s excitement is hinged by Minsky [3] in 1969. He proposed the famousXOR problem and the inability of perceptrons in such linearly inseparable data distribution S. It was the Minsky's tackle to NN community. Thereafter, NN researches would is dormant up until 1980s
XOR problem which is nor linearly seperable data orientation
There had been not to much effort until the intuition of multi-layer Perceptron (MLP) is suggested byWerbos [6] in 1981 with NN specific backpropagation (BP) algorithm, albeit BP idea had been proposed before by Linnainmaa [5] in 1970 in the name "reverse mode of automatic differentiation". Still BP is the key ingredient of today ' s NN architectures. With those new ideas, NN researches accelerated again. In 1985–1986 NN researchers successively presented the idea's MLP with practical BP Training (RUMELH Art, Hinton, Williams [7]–hetch, Nielsen[8])
From Hetch and Nielsen [8]
At the another spectrum, a very-well known ML algorithm is proposed by J. R. Quinlan [9] in 1986/We call decision Trees , more specifically ID3 algorithm. This is the spark point of the another mainstream ML. Moreover, ID3 was also released as a software able to find more real-life use case with its simplistic rules and its clear Inference, contrary to still black-box NN models.
After ID3, many different alternatives or improvements has been explored by the community (e.g. ID4, Regression Trees, CA RT ...) And still it is one of the active topic in ML.
From Quinlan [9]
One of the most important ML breakthrough was support Vector Machines (Networks) (SVM), pro Posed by Vapnik and cortes[10] in 1995 with very Strong Theoretical standing and empirical results. That is the time separating the ML community into the crowds as NN or SVM advocates. However the competition between and community was not very easy for the NN side after Kernelized version of SVM by near 2000s . (I wasn't able to find the first paper about the topic), SVM got the best of many tasks that were occupied by NN MoD Els before. in addition, SVM is able to exploit all the profound knowledge of convex optimization, generalization ma Rgin theory and kernels against NN models. Therefore, it could find large push from different disciplines causing very rapid theoretical and practical impr Ovements.
From Vapnik and Cortes [10]
NN took another damage by the work of Hochreiter ' s thesis [+] in 1991 and Hochreiter et. al.[11] In 2001 , Sho Wing the gradient loss after the saturation of NN units as we apply BP learning. Simply means, it's redundant to train NN units after a certain number of epochs owing to saturated units hence NNs was ve Ry inclined to over-fit in a short number of epochs.
Little before, another solid ML model was proposed by Freund and Schapire in 1997 prescribed with boost Ed Ensemble of weak classifiers called Adaboost. This is also gave the Godel Prize to the authors at the time. Adaboost trains weak set of classifiers that is easy-to-train, by-giving more importance-hard instances. This model still the basis of many different the tasks like face recognition and detection. It is also a realization of PAC (Probably approximately Correct) learning theory. In general, so called weak classifiers is chosen as simple decision stumps (single decision tree nodes). They introduced Adaboost as;
The model we study can interpreted as a broad, abstract extension of the well-studied on-line prediction model to a Gen eral decision-theoretic Setting ... [11]
Another ensemble model explored by Breiman [inch] in 2001 that ensembles multiple decision trees where EAC H of them is curated by a random subset of instances and each node was selected from a random subset of features. Owing to its nature, it is called Random forests (RF) . RF has also theoretical and empirical proofs of endurance against over-fitting. Even AdaBoost shows weakness to over-fitting and outlier instances in the data, RF are more robust model against These caveats. (For more detail on RF, refer Tomy Old post.) RF shows its success in many different the tasks like Kaggle competitions as well.
Random forests is a combination of the tree predictors such that each tree depends on the values of a
Random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. To a, limit as the number of trees in the forest becomes large[12]
As we come closer today, a new era of NN called Deep learning have been commerced. This phrase simply refers NN models with many wide successive layers. The 3rd rise of NN have begun roughly in 2005 with the conjunction of many different discoveries from past and Present by recent mavens Hinton, LeCun, Bengio, Andrew Ng and other valuable older researchers. I enlisted some of the important headings (I guess, I'll dedicate complete post for deep learning specifically);
- GPU Programming
- convolutional NNs [18][20][40]
- deconvolutional Networks [+]
- optimization algorithms
- Stochastic Gradient descent [19][22]
- BFGS and L-bfgs [23]
- conjugate Gradient descent []
- backpropagation [40][19]
- rectifier Units
- sparsity [15][16]
- Dropout nets []
- unsupervised NN models [+]
- deep belief Networks []
- stacked auto-encoders [16][39]
- denoising NN models [+]
With the combination of all those ideas and non-listed ones, NN models is able to beat off state of art at very differ ENT tasks such as Object recognition, Speech recognition, NLP etc. However, it should be noted that this absolutely does not mean, it's the end of other ML streams. Even deep Learning success Stories grow rapidly, there is many critics directed to training Cost and TUNING&NB Sp;exogenous parameters Of these models. Moreover, still SVM is being used more commonly owing to its simplicity. (said but may cause a huge debate )
Before finish, I need to touch in one another relatively young ML trend. After the growth of WWW and social Media, a new term, Bigdata emerged and affected ML of the wildly. Because of the large problems arising from bigdata, many strong ML algorithms is useless for reasonable systems Giant Tech Companies of course). Hence, people come up with a new set of simple models that is dubbed Bandit algorithms [27–38] (form Ally predicated with Online learning ) This makes learning easier and adaptable for large SCA Le problems.
I would like to conclude this infant sheet of the ML history. If you found something wrong (you should), insufficient or non-referenced, please don ' t hesitate to warn me in all Manne R.
references--
[1] Hebb D. O, the organization of behaviour. New York:wiley & Sons.
[2] Rosenblatt, Frank. "The PERCEPTRON:A probabilistic model for information storage and organization in the brain." Psychological Review 65.6 (1958): 386.
[3] Minsky, Marvin, and Papert Seymour. "Perceptrons." (1969).
[4] Widrow, Hoff "Adaptive switching circuits." (1960): 96-104.
[5] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor
Expansion of the local rounding errors. Master ' s thesis, Univ Helsinki, 1970.
[6] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th
IFIP Conference, 31.8–4.9, NYC, pages 762–770, 1981.
[7] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning Internal representations by error propagation. No. ICS-8506. CALIFORNIA Univ SAN DIEGO LA JOLLA INST for Cognitive Science, 1985.
[8] Hecht-nielsen, Robert. "Theory of the BackPropagation neural network." Neural Networks, 1989. IJCNN., International Joint Conferenceon. IEEE, 1989.
[9] Quinlan, J. Ross. "Induction of decision trees." Machine Learning 1.1 (1986): 81-106.
[Ten] Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine Learning 20.3 (1995): 273-297.
[One] Freund, YOAV, Robert Schapire, and N. Abe. "A Short Introduction to boosting." Journal-japanese Society for Artificial Intelligence 14.771-780 (1999): 1612.
[Breiman], Leo. "Random forests." Machine Learning 45.1 (2001): 5-32.
[Hinton], Geoffrey E., Simon Osindero, and Yee-whye Teh. "A Fast Learning algorithm for deep belief nets." Neural Computation 18.7 (2006): 1527-1554.
[Bengio], Lamblin, Popovici, Larochelle, "greedy layer-wise
Training of Deep Networks ", NIPS ' 2006
[Ranzato], Poultney, Chopra, lecun "efficient learning of Sparse representations with an energy-based Model", NIPS ' 2 006
[+] Olshausen B A, Field DJ. Sparse coding with a overcomplete basis set:a strategy employed by V1? Vision Res. 1997;37 (23): 3311–25. Available at:http://www.ncbi.nlm.nih.gov/pubmed/9425546.
Vincent, H. Larochelle Y Bengio and P.a. Manzagol, extracting and composing robust Features with denoising Autoenco DERs, Proceedings of the Twenty-fifth International Conference on machine Learning (ICML '), pages 1096–1103, ACM, 200 8.
[Fukushima], K. (1980). NEOCOGNITRON:A self-organizing Neural network model for A mechanism of pattern recognition unaffected by shift in Positio N. Biological cybernetics, 36, 193–202.
[LeCun], Yann, et al. "Gradient-based Learning applied to document recognition." proceedings of the IEEE 86.11 (1998): 2278-2324.
[LeCun], Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The Handbook of Brain Theory and neural Networks3361 (1995).
[Zeiler], Matthew D., et al. "Deconvolutional networks." computer Vision and Pattern recognition (CVPR), IEEE Conferenceon. IEEE, 2010.
S. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Mur-phy. Accelerated training of conditional random fields with stochastic meta-descent. In international Conference on Ma-chine Learning (ICML ' 06), 2006.
[Nocedal], J. (1980). "Updating quasi-newton matrices with Limited Storage." Mathematics of Computation 35 (151): 773782. doi:10.1090/s0025-5718-1980-0572855-
S. Yun and K.-c. Toh, "A coordinate gradient descent method for l1-regularized convex minimization," computational optimizations and Appli Cations, vol, no. 2, pp. 273–307, 2011.
[Goodfellow] I, Warde-farley D. Maxout networks. arXiv prepr arXiv.... Available at:http://arxiv.org/abs/1302.4389. Accessed March 20, 2014.
Wan L, Zeiler m. Regularization of neural networks using Dropconnect. Proc.... 2013; (1). Available At:http://machinelearning.wustl.edu/mlpapers/papers/icml2013_wan13. Accessed March 13, 2014.
[Alekh Agarwal], Olivier Chapelle, Miroslav Dudik, John Langford, A Reliable effective Terascale Linear Learni Ng System, 2011
M. Hoffman, D. Blei, F. Bach, Online Learning for latent Dirichlet Allocation, in neural information processi Ng Systems (NIPS) 2010.
[Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang agnostic Active learning without Constraints N IPS 2010.
John Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient Methods for Online learning and Stochastic Opti Mization, JMLR & COLT 2010.
H. Brendan McMahan, Matthew Streeter, Adaptive Bound optimization for Online convex optimization, COLT 2010.
[Nikos] Karampatziakis and John Langford, importance Weight Aware Gradient Updates uai 2010.
[Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale multitask Learning, ICML 2009.
[Qinfeng Shi], James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan, Hash kernels For structured Data, Aistat 2009.
John Langford, Lihong Li, and Tong Zhang, Sparse Online learning via truncated Gradient, NIPS 2008.
[approx] Leon Bottou, Stochastic Gradient descent, 2007.
[PNS] Avrim Blum, Adam Kalai, and John Langford beating the holdout:bounds for Kfold and Progressive Cross-validati On. COLT99 pages 203-208.
[Nocedal], J. (1980). "Updating Quasi-Newton matrices with Limited Storage". Mathematics of computation 35:773–782.
[G] D. Ballard. Modular Learning in neural networks. In Aaai, pages 279–284, 1987.
[Hochreiter] S.. Untersuchungen zu dynamischen Neuronalen Netzen. Diploma Thesis, Institut f. ur in-
Formatik, Lehrstuhl Prof Brauer, Technische Universit? At M. Unchen, 1991. Advisor:j. Schmidhuber.
Brief History of the machine learning