Machine learning/Data mining/algorithms summary of post-test questions

Source: Internet
Author: User

1, How to judge over-fit and under-fitting, how to solve?

A: It is possible to determine whether a fitting or a fit is done by training error and test error. Generally speaking, the training error is very low, but the test error is high, the probability of overfitting is large, if the training error and the test error are very high, it is generally less fitting. Over-fitting can start with increasing sample size, reducing feature count, reducing model complexity, and practical examples such as linear regression, there is no need to use dozens of variables to fit the data points of dozens of samples. Under-fitting is the opposite, it needs to consider whether the model is convergent, whether the characteristics are too small, the model is too simple to start. In addition, L1,L2 regularization is used to limit weights and dropout is used in neural networks to make each training network structurally diverse. L1 regularization is actually the absolute value of the weight and the addition of the loss function, so that the weight of the value of 0 is increased, so the weighted value is relatively sparse. L2 regularization is to add the weight of the square and the loss function, so that the weight distribution more evenly, so the weight is more smooth.

2, How to construct the characteristics?

A: In fact, the characteristics are mainly for business to construct, business correspondence data, for example, time characteristics may be effective in traffic prediction, but for text mining may be invalid. Therefore, we can consider the data statistical analysis, combined with the business scene construction features, later can consider the refinement features or combination features.

3. What is the meaning and derivation of logistic regression? What is the difference between logistic regression and linear regression?

A: The meaning is not much to say, this algorithm principle is the most basic. The derivation can start from the minimization of loss function or the direction of maximum likelihood. The difference between the two was asked when interviewing Ali, then blurted out a classification is a return, but the deep meaning may be one is iterative solution, one is directly solved. Hope Advice

4, How to optimize the model? How to evaluate the model good or bad?

A: Model optimization mainly from the data and model two aspects, according to specific problems, such as over-fitting and too little data volume can be considered to increase the amount of data. model evaluation indicators include classification and regression , classification such as accuracy rate ,AUC value , or business-related weighted calculation formula. It is emphasized that the AUC value of the ROC curve is more important, and it is necessary to know how the specific ROC curve is drawn . Regression words such as Mse,rmse or business-related weighted calculation formulas.

5, How to clean the data, how to deal with the missing value?

A: Data cleaning is mainly through the analysis of data statistics, distribution, lack of information, etc., in the premise of better data quality, as far as possible to retain more data . Missing values are handled in many ways, and are based on specific characteristics and business, can be randomly filled , mean-filled , or using simple algorithms such as KNN, clustering to fill . Of course, if some characteristics or some samples of the missing rate is too large, you can consider the direct abandonment , is the case.

6,the meaning of bagging,boosting?

A: Bagging is primarily associated with random forests, with a back-up sampling, so a sample may appear in the training set of multiple trees and may not appear at once, and can be parallel . In addition the characteristic sets of each tree are selected from the original feature set also randomly partial feature sets as splitting sets . Boosting mainly and Adaboosting association, each tree is based on the previous tree training residuals as input, so the general is serial running, each tree training set is the entire sample set, in addition to the characteristics also did not choose.

In addition, some of the relevant positions focus on:

1. Algorithm Engineer

The job content of this post depends on the company, but it is generally inseparable from the model algorithm, but this algorithm may be image, voice, text or other business product modeling. So specific preparation direction also see specific job requirements, image algorithm For example, now deep learning hot not I said, so the basic convolution neural network algorithm , image classification , image detection The more famous paper in recent years should read it. If you have a condition, use it like a caffe,tensorflow frame.

2. Machine Learning Engineer

This post is basically the same as the algorithm, and mainly serves the internal business of some transaction data or traffic data modeling. So the basic machine learning algorithms , optimization methods and other theories you need to understand, and then match some of the project or the game experience is better. There will be additional points of experience with Spark .

3. Big Data Platform Engineer

This post focuses on platform development, such as your company to develop such a platform, the company's machine learning engineers running models are running on this platform, involving distributed Systems will be a little more, the algorithm is not much.

4. Data Mining engineer

This post is mainly to look at the company, some companies may do modeling work, some companies do data analysis or ETL work, so the interview must be asked clearly.

5. Data Analysis Engineer

From the title is also seen mainly to do some data statistical analysis of the work, to be honest, before modeling a very important job is to need you to have a full understanding of their data, but the general machine learning post can do data analysis work, or deal with a problem too many steps really troublesome. If you're ready, you can start with some data analysis statistics and visualizations from the R language, for example. The algorithmic aspect should involve few things.

6. ETL Engineer

This post is needed by many companies, mainly to do the pre-processing of data , including data cleaning , collation , calibration , etc., very cumbersome, but very important. You can start with languages such as SQL.

Machine learning/Data mining/algorithms summary of post-test questions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.