1 Scenario Resolution: A. Data exploration (size of data, missing or garbled data, ETL operation, field type, whether or not the target queue is included)
B. Scene abstraction (it is through the existing data, to dig out the business scenarios can be applied.) Machine learning is primarily used to address scenarios including two classification, multi-classification, clustering, and regression.
C. Algorithm selection (is to determine the algorithm range, multi-algorithm attempts and multi-view analysis to find the most suitable for their own business algorithm)
2 Data preprocessing: sampling, de-noising, normalization (0,1) and data filtering, data mining as a dish, data preprocessing is the process of selecting and cleaning vegetables, this step does not do well will affect the taste of the whole dish.
3. Feature Engineering: Feature abstraction (abstraction of source data into data that can be understood by algorithms), feature importance assessment, feature derivation (feature-derived approach to mining more valuable features) and feature dimensionality reduction (principal component analysis). PCA maps high-dimensional data to low-dimensional space by linear mapping projection, and linear discriminant analysis Lda.
Timestamp, two-valued class problem, multi-valued ordered class problem, multi-value unordered class problem (information castration), multi-value unordered class problem (One-hot encoding), text type, image or speech data (first transform image or speech into matrix structure).
4. Model building, evaluation, tuning
5. Results Output and analysis
General algorithm
Deep learing
The inverse propagation algorithm, also known as BP algorithm (backpropagation algorithm), is the core idea of the supervised Learning algorithm algorithm, which is the chain rule of derivation. BP algorithm is often used to solve the optimization problem in neural networks, which is different from the optimal solution of shallow-layer algorithm, and the BP algorithm can calculate the gradient of each layer iteration by the chain law.
The core idea of automatic coding (Autoencoder) is to generate a function f by training, so that f (x) is approximately equal to X, that is, to get a function that makes the input and output as equal as possible.
There is a systematic study of machine learning algorithms and the common structure of deep learning. Common algorithms are as follows:
Machine Learning Algorithms:
Classification algorithm: KNN,NB,LR,RF,SVM, etc.
Clustering algorithm: K-means,dbscan
Regression algorithm: Linear regression
Text Analysis algorithm: Word segmentation algorithm hmm, keyword extraction algorithm TF-IDF, subject model LDA
Recommended class algorithm: Collaborative filtering CF (UCF/ICF)
Graph algorithm: Label propagation, Shortest path
Commonly used dimensionality reduction methods: To ensure the independence of the vector, reduce the correlation to reduce the amount of computational noise, the results of meaningless or less meaningful fields removed, reduce unnecessary interference. Deep learning common structure: Deep neural network DNN convolutional Neural network CNN (convolution, down sampling, full connection), mainly to the spatial data processing, input layer format unified. Cyclic neural Network (RNN) is commonly used to solve the problem of timing behavior. The input layer format can be non-uniform.
Machine learning processes, conventional algorithms, dimensionality reduction methods