Machine learning system Design (Building machines learning Systems with Python)-Willi Richert Luis Pedro Coelho General statement
The book is 2014, after reading only found that there is a second version of the update, 2016. Recommended to read the latest version, the ability to read English version of the proposal, Chinese translation in some places more awkward (but the English version of the book is indeed somewhat expensive).
The purpose of my reading: Extensive reading is mainly about the way to peep at others ' thinking.
The author's goal of writing a book: for beginners, but it's good to have time to see it. "I hope it inspires your curiosity and is enough to keep you eager and constantly exploring this interesting area," the author said. In my opinion, this book has achieved this goal, relative to the theory of strong books, such books can make it easier for people to insist on reading, not so obscure.
Positive feeling:
- Focusing on the engineering practice of machine learning system, there is no obscure theory, but the content is enough to introduce the modeling and solving process clearly. Readers with time can try step by step. I do not practice, because usually the task of the laboratory is busy, but some of the ideas can be borrowed from the work. (Reading is a lot of the time to know the same question how others do, but also divergent ideas).
- You can feel the way the author teaches us how to learn. Unlike many of the books that give the best solutions directly, the book begins with the most basic baseline, and then gradually discovers the problem and is tuned, and the process is match with the engineering practice. One bite can not eat a fat person, how to decompose the problem, slowly optimization is the key. For example (1) at the very beginning, the author points out "What to do when you encounter difficulties, and to share the learning methods: encourage you to build your own list of machine learning related blogs and read them regularly." Kaggle is very important!! "---convey his attitude to the problem, we must learn to accumulate and grow." (2) 2nd: Upgrade from a simple classifier to a more complex classifier: Model structure (threshold partitioning feature), search process (try as much as possible on all feature combinations and threshold combinations), loss function (to determine which possibilities are not too bad), and try again and again. (3) The 5th chapter detects the poor-quality answer, and through the analysis of deviation-variance (bias-variance), finds the point which can be optimized and optimizes the model. (4) The 6th chapter analyzes the emotion, through the characteristic analysis and the characteristic engineering, constantly discovers the valuable characteristic to the tuning model. (5) The 9th Chapter Music genre classification, audio feature extraction from Fourier transform (FFT) to the Mel Cepstrum coefficient (MFCC), which is the author through the Music Information retrieval related fields paper obtained information. The process of visible tuning requires an understanding of the domain, extensive research, and the ability to identify information. (6) Finally, we share some learning resources that the authors think are good.
Add:
The original intention of the book means that it lacks in a systematic and theoretical sense. Therefore, it is desirable to dig deep in the ML field, or to excavate the theories and concepts involved, for example, the book that the author mentions last is a good choice. For the tools, I think just the means to solve the problem, computer origin of the use of tools should have a set of their own quick way to get started. Although it is difficult for time to practice every book, such as the Amazon Cloud, if the time is best to try, but if the time is limited can probably understand that there is such a tool can solve such problems.
It is also worth emphasizing that the information age is rapidly changing, and many books, especially tools, are advancing with the times. Therefore, in the real application should still do enough research, to understand the latest progress.
The main routes of machine learning systems are: problem abstraction, data acquisition, exploration, data cleansing, feature extraction, model selection and tuning. Pay attention to use effective assessment means to evaluate the results, to investigate the field adequately, so as to better identify the problem and adjust the model. The most noteworthy is the feature engineering , the characteristics of the design is often more like an art. In general or to accumulate more, more divergent thinking, hands-on to do, reflect on the summary, gradual.
Review of each chapter
Getting Started with 1.Python machine learning:
This paper introduces the orientation of the book and some learning experiences. A brief introduction to the required Python libraries: NumPy, SciPy, matplotlib, etc. Demonstrates a small application, regression problem. Machine Learning Engineering Practices: Collect data, read and cleanse data, explore and understand data, characterize engineering, select the right models and algorithms, and evaluate correctly. Each step is critical, in the engineering practice of data processing (cleaning, exploration, understanding, feature engineering), especially the characteristics of the project plays a very important role. In fact we have most of the time to spend before the model. The concept of overfitting, train set and validation set.
2. How to classify real samples:
- Iris DataSet, which is a very classic dataset, Scikit-learn the Basic sample datasets commonly used in tutorial. This paper focuses on the cross-validation (Zhouhuazhi-machine learning, which is a good summary of the model evaluation). Error: Training error, test error, generalization error. Our ultimate goal: to reduce the generalization error. More complex classifiers: Model structure (threshold partitioning feature), search process (try as much as possible on all feature combinations and threshold combinations), loss functions (to determine which possibilities are not too bad), and try again and again.
- Seeds datasets that describe more complex data and more complex classifiers. Focusing on features and feature engineering , it is necessary to use the background knowledge to determine what is good characteristics through intuition. Fortunately, there are many areas where there are many references to the types of features and features that may be used. (This illustrates the need for knowledge and extensive research). Feature engineering is often a place where you can get the most out of the correct rate, because better feature data can often defeat beautiful methods (the CNN core is feature extraction). There are many options to mix and match. Two categories vs multiple classifications.
3. Clustering: Find Related Posts
- A brief introduction to the background of text processing. Terminology:bag-of-word, similarity calculation (cosine, Pearson, jaccard), frequency vector normalization, deletion of unimportant words--deactivation words, stemming, word frequency-reverse document frequency (TF-IDF).
- Tool:nltk
- Steps: (1) Extract the characteristics of each post and quantify it, map a post to a vector, (2) cluster on these vectors, (3) determine the cluster for each post to be clustered, and (4) for each cluster, get a few posts different from the posts to be clustered, and promote diversity.
- Objective: To cut out a penny, throw away words that appear too frequent and do not help to detect (stop words); throw away the words that appear low, only small ones that may appear in future posts; count the remaining words; Consider the whole corpus, and calculate from the word frequency statistic TF-IDF (now mostly using deep learning for representation learning, recursive neural networks have achieved good results in text, TF tutorial good: https://www.tensorflow.org/ tutorials/recurrent).
- Cluster: K mean value. Noise processing. Parameter adjustment.
4. Thematic model
- The extension of the previous section, the more advanced text grouping method. Task: Get a collection of text and reverse-engineer it, find out what topics are in it, and which topics each document belongs to. What are these themes? Technically, they are the polynomial probability distributions of words.
- LDA models the whole of Wikipedia. On average, each document involves only 6.5 topics, with 93% of the documents involving less than or equal to 10 of the number of subjects.
- Select the number of topics. Remove the Stop word.
- Tool:gensim
5.!! Category: Detect poor-quality answers
- There's no perfect answer. Tuning Route:knn-> logist Regression---a good model for some data.
- Two questions:
- How to represent data samples: how to extract features;
- What models or structures should be used: Logistic regression, decision Tree, SVM, and naive Bayes. This knot: KNN vs. Logistics regression
- Preprocessing: Get data--cut to the level of processing--understand the role of each property and make a selection (business background)--define what is a good answer: If score > 0 then is a positive example, if score <=0 then is a negative example
- Classifier Tuning: KNN + number of hyperlinks, accuracy 49%-more features: Add lines of code, accuracy 0.583 to continue adding features: Avgsentlen, Avgwordlen, Numallcaps, NuMex Clams, adding 4 features is worse, accuracy 0.5765 and how to improve the effect (4 directions): Add more data, consider the complexity of the model, modify the feature space, change the model.
- Unordered selection is not a good way to analyze the bias-variance (deviation-variance) tradeoff method, underfitting vs. overfitting balance. We want low bias at the same time low variance. But in practice we have to balance the two, because one reduction is likely to lead to another increase. (Andrew's courses and Zhou Hua's machine learning are summarized)
- Strategies for High-bias: Add more features, make the model complex, or try another model.
- Strategies for High-variance: More data, lower complexity of the model, and less features.
- Practice: Graphing Train/test Error vs Data set size. (There are some errors, train error and test error are shown in reverse). Different k corresponding to the train/test error mapping, K-increase effect is better, that is, to reduce the complexity of some positive impact.
- KNN's disadvantage: the need to store all of the training data space is expensive and time-consuming to predict.
- Baseline Knn,k=90,accuracy 0.628-and logistic regression, selecting parameters c=0.1,accuracy 0.631,--> bias-variance Analysis, observed: high bias---Test/trai N Error error is very close, conclusion: the data is too noisy, for distinguishing different categories of the feature set is not suitable. ---> Observe the right rate behind: precision and RECALL,PR-CURVE,AUC, classified against inferior answers (A) vs Classification for high-quality answers (B). A's precision and recall are very low and need not be considered. B The effect is good, further adjust the threshold, you can get 80% precision,recall 37%, can you tolerate a low recall? ---> Classifier slimming: Determine the importance of features by the coefficients of logistic regression and remove unimportant features.
6. Classification II: Affective analysis
- Background: It is important for companies to closely monitor the public's attitude towards important events, such as product launches or press releases. Twitter can categorize the emotions of tweets. Sometimes also called opinion excavations (Opinion mining).
- Target: (1) Introduce the classification algorithm: Naive Bayesian, (2) Explain the part of the word (part of Speech, POS), (3) show some occasional tips from the Scikit-learn Toolbox
- Get Twitter data (text + tags (front, negative, neutral)),
- Naive Bayes:
- The handling of irrelevant features is very strong;
- Sources of naive: characteristics of mutually independent conditions are assumed. Posteriori probability = prior * likelihood/evidence
- Words not appearing (features not appearing): Laplace smoothing (Laplace smoothing), plus 1 smoothing (additive smoothing).
- Consider arithmetic underflow: a small value to multiply (too small!!!) ), log (x*y) = log (x) + log (y)
- Classifier: GAUSSIANNB, MULTINOMIALNB, BERNOULLINB
- Simple question: Handle only plus or minus two categories, p/r AUC 0.88
- Use all categories: positive, negative, neutral, p/r auc:sent (pos or neg) vs. Rest:0.68;pos vs.rest:0.31, neg vs. rest:0.51
- Tuning the classifier: Tfidfvectorizer, MULTINOMIALNB, use GRIDSEARCHCV to select the combined parameters. Evaluation: F1-score. Pos vs. rest:0.52, neg vs. rest:0.64
- Purge tweets: Sent vs. rest:70.7 (with Ascension, tfidfvectorizer in preprocessor)
- Take the word type into account: Language information: nouns, verbs, adjectives. The type of the word is determined to be part of the POS callout (parts of Speech tagging, point of sale). Tool:nltk,sentiwordnet gives most English words a positive score and a negative score. Sense disambiguation (Word sense disambiguation).
- Mix everything together: Featureunion:tfidvectorizer + Word type + naive bayes. Pos vs. neg:0.808, pos vs. neg:0.794, pos vs. restl:0.886, neg vs. rest:0.881
7. Regression: Recommended
- Regression forecast Price: ordinary least squares (ordinary Least squares, OLS)-not available when considering many features.
- A better approach: lasso (L1 regularization), Ridge (L2 regularization), elastic net (lasso + ridge).
- Multidimensional regression. Punitive regression, L1 VS L2. Netflix Challenge.
- P is greater than n problem: the number of features p, the number of samples N, that is, p is greater than N. OLS no longer applies.
- Smart Setup Hyperparameter: Evaluation of generalization capability requires two-layer cross-validation, for example 10-fold,1 as Test set,9 train set---train set and validation set. Finally put into use all the train sample training once again.
- Good algorithm is a good thing, but you must personally tune your approach to adapt to the characteristics of the data
- Modeling: Classification vs Prediction (The score is rough, median is meaningful such as 1.5), two options: Movie-spercific, user-specific. (Collaborative filtering, user-movies matrix)
8. Regression: Recommendations for improvement (Recommended Books: Recommendation System Practice--Xiangliang)
- Shopping basket Analysis (Basket analyst)
- Recommendations for improvement:
- Data: 0/1 matrix, no rating/evaluation.
- Step: (1) Calculate the similarity between users and sort the other user. (2) When we need a user-movie data to estimate the score, we sequentially look up the user's nearest neighbor. When the first rating of the movie is found, it is output.
- Movie-based similarity, according to the score of similar films to estimate.
- Combining multiple approaches: integrated learning (Ensemble learning)/stack Learning
- Shopping Basket Analysis: Apriori Association Rules mining. Tool:pymining
9. Category III: Music genre classification
- So far, we have assumed that any training sample can easily be described by eigenvectors. Such a expression is "extravagant".
- How do I express a 3-minute-long song?
- Objective: To build an excellent classifier outside of the comfort zone, you must use sound-based features. Multi-classification issues: Jazz, Classical, country, Pop, Rock, Metal.
- Get Music Data---> Watch music: Matplotlib's Specgram () the sound spectrum of music, Fourier transform (Fast Fourier Transform, FFT), the music is decomposed into sine wave composition;---> Build a classifier with FFT: mixed Xiao Matrix (figure can be cleaned to tell us where to focus on optimization), ROC (Receiver Operator characteristic)---> improve the classification effect with the Mel Cepstrum factor (MFCC).
After reading a few AMGC (a sub-domain of Music information retrieval), the reader found a lot of work on automatic genre categorization. (We need to take the initiative to collect more information)
10. Computer Vision: Pattern recognition
- Tool:mahotas traditional image processing functions for computer vision packages: Data preprocessing, noise cancellation, image cleanup, contrast stretching, and more. Introduction to
- image processing: Scale invariant feature transformation (scale-invariant Feature transform,sift)
- reading and displaying images: subtracting pixel mean in an image is often useful. Helps normalization of images under different illumination, standard mean
- image processing Basics:
- Thresholds: Card thresholds a very simple operation, if x > Threshold then 1, if x < Thresh Old and then 0. Rgb2gray.
- Gaussian Blur: Often used for noise reduction, you can remove the
- different effects of filtering from the overall layout-independent details:
- Add salt and pepper noise: analog scan noise; Focus Center.
- Pattern Recognition: For historical reasons, image classification is also called pattern recognition
- computed image features: Haralick texture features, calculate features not only for classification but also to reduce dimensions
- design Your own features: one of the advantages of machine learning is that We just need to write some ideas, and then we can let the system find out which ones are good and which are not.
- This example embodies the principle that a good algorithm is just the easy part. You can always find a cutting-edge classification method to achieve. But the real secret and value added are usually in feature design and feature engineering. This is where the value of the knowledge of the data itself lies.
- classify on more difficult datasets
- Local feature representation (local feature):
- Randomly calculates, calculates in a lattice, detects areas of interest in an image (critical point detection, i.e. Keypoin T detection)
- Bag-of-word model, visual Word: A group of areas that look similar in an image are called visual words.
- Each image can be represented by a list of the same number of features.
- (currently the best for image processing is CNN)
11. Dimension reduction
- The necessity of dimensionality reduction:
- Superfluous feature back effect or misleading classifier
- More features mean more parameter adjustments, greater risk of overfitting
- The dimension used to solve the actual problem may be virtual high
- The less dimension means that the faster you train, the more you can try.
- Visualization of
- Dimensionality reduction Method: Feature selection method and feature extraction method. Generate, analyze, and then discard some features.
- Principal component analyses (Principal Component analysis, PCA), linear discriminant analysis (Linear discriminant analyses, LDA) and multidimensional scaling (multidimensional scaling, MDS).
- There are two common ways of doing this:
- Filters (Filter):
- Correlation (linear and non-linear), mutual information (depends not on the data series but on the distribution of the data)
- Cons: Throw away features that are useless when used independently. But the reality is that some features seem to be completely independent of the target variables, but they work when they are grouped together.
- Wrapper (wrapper)
- Feature recursive elimination (recursive feature elimination)
- Cons: To set the number of reserved features, but you can try
- Other Feature selection methods:
- Embedded learning: such as decision Trees, L1
- Feature Extraction:
- Linear: PCA:
- Core: The maximum variance is retained, and the final reconstruction error is minimal.
- Limitations: (1) There are limitations in the processing of nonlinear data, such as KERNELPCA to solve nonlinear problems. (2) unsupervised--considering LDA, the maximum distance between the different categories of samples is given.
- Why the PCA is preferred rather than LDA: as the number of categories increases, the sample count for each category is sparse, and LDA is less effective, and for different training sets, PCA is not as sensitive as LDA. (Look at the situation!!) )
- Nonlinearity: Multi-scale method
- Reduce the dimensions while preserving the relative distance of the samples as much as possible----this is useful when you have a high-dimensional dataset and want to get a visual impression.
- MDS does not care about the data point itself, but instead, it is interested in the similarity between data points
- To use MDS, you need to understand each special; maybe the distance we use is not comparable to the Euclidean distance.
- MDS is a useful tool for revealing data similarity, which is difficult to see in the original feature space
- MDS is not an algorithm, but a class of algorithms
- Feature selection and extraction is more like an art.
12. Big Data
- "Big Data" does not refer to the exact amount of data, neither the number of samples nor the number of G-bytes, T-bytes, or P-bytes occupied by the data. What it means:
- Data size is faster than capacity to handle it
- In the past, some of the best-performing methods and techniques need to be re-made, because their ability to scale is not good
- Your algorithm cannot assume that all data can be loaded into memory
- Managing the data itself becomes a major task
- Using a computer cluster or multi-core processor is a necessity, not a luxury.
- Tool
- Python jug, a small Python framework that manages computations that take advantage of multicore or host computers.
- Cloud service platform, Amazon Web services platform, AWS.
13. More Machine learning Knowledge:
- Online resources: Andrew Ng machine learning
- Reference books:
- Pattern Recognition and machine learning (Christopher M. Bishop, Springer)
- Machine Learning:a probabilitic Perspective (K. Murphy, the MIT Press)
- FAQ website:
- Metaoptimize ()
- Cross Validated, a statistical learning site, usually also designs machine learning.
- This book connects: http://www.twotoreal.com/, there is a new version.
- Blog
- Machine learning theory: http://hunch.net/
- John Langford's Blog (he is the leader behind Vowpal Wabbit----http://hunch.net/~vw/---). Speed: one post per month.
- A practical approach to text and data mining, http://textanddatamining.blogspot.de
- Edwin Chen's Blog: http://blog.echen.me
- A post/month, a more practical topic
- Machine learning, http://www.machinedlearnings.com
- A post/month, more practical topic, usually around big data learning
- Flowingdata, http://flowingdata.com
- A thread/day, mainly to solve some statistical problems
- Normal deviate,https://normaldeviate.wordpress.com/
- Post/month, the theoretical aspects of practical problems are discussed.
- Simple statistics, http://simplystatistics.org
- Posts/month, focusing on statistics and big data
- Statistical modelling, causal reasoning and social sciences, http://andrewgelman.com
- Post/day, the author uses statistical principles to point out the shortcomings of the popular media is very interesting
- Data resources:
- UCI Machine Learning Repository
- Increasing competition: Kaggle
Machine learning system design (Building machines learning Systems with Python)-Willi richert Luis Pedro Coelho