A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
New Smart Dollar recommendations
"New wisdom meta-reading" This is a popular Kaggle article published by data scientist Abhishek Thakur. The author summed up his experience in more than 100 machine learning competitions, mainly from the model framework to explain the machine learning process may encounter difficulties, and give their own solutions, he also listed his usual research database, algorithm, machine learning framework, etc., with a certain reference value. "The article covers almost all the problems that machine learning faces," the authors say. "How did he say?" Welcome to comment, express your opinion.
After being posted on LinkedIn, the article was quickly transferred to Kaggle and hacker News and sparked a hot discussion. In Hacker News, some people think that the author only from a data scientist's point of view of machine learning research, the method has some limitations. In addition, if you really want to use the machine learning framework proposed by the author, you need to have a huge amount of data.
The following is the full text of the new smart-dollar compilation:
Abhishek Thakur: data scientists deal with data loading problems every day. Some researchers say that their 60%--70% time is spent on data cleansing, processing (screening) and conversion, which allows the machine learning model to use the data. This article focuses on the second part, which is the application of the data in the machine learning model, which includes the preprocessing steps.
Several of the pipelines discussed in this article are the sum of the hundreds of computer games I have attended. It is important to emphasize that the relevant discussion of the article, although it is general, but also very useful, at the same time, the paper also discusses some of the existing and used by professionals in the complex methods.
Disclaimer: We use Python.
Before the machine learning model is used, the data must be converted into a list (tabular) Form. This is the most time-consuming and also the most difficult, the process is as follows:
The machine learning model is then used to train the list data. List data is the most common way to represent data in machine learning and data mining. We first have a data table and then arrange the different sample data, or label it with X and Y. These labels can be single-line, or multiline, depending on the type of problem to be resolved. In this, we will use X to represent the data and use Y as a label.
Type of label
These tags define the problem to be solved and can have different forms:
Single row, binary value (classification problem, one sample only belongs to one species, and the total number of species is only 2)
Single line, truth value (regression problem, predicting unique values)
Multiline, binary Value (classification problem, a sample belongs to a category, but there are 2 or more kinds)
Multiple lines, true value (regression problem, predicting multivalued)
Multiple labels (classification problem, one sample can belong to different kinds)
For any machine learning challenge, we need to know how to evaluate our findings, or what the value and object of the assessment is. In order to prevent the problem of negative bias (skewed) in binary classification, we usually choose to evaluate the area below the receiver (receiver) of the running feature curve (ROC AUC or simple AUC).
In multi-label and multi-type classification challenges, we typically choose to classify the interaction entropy, or multiple types of log loss, and reduce the squared error in regression problems.
Watch and perform data processing: Pandas
Various machine learning models: Scikit-learn
Best gradient Progressive database look: xgboots
Neural network: Keras
Drawing data: Matplotlib
Monitoring progress: TQDM
I don't have to Anaconda, it's easy to use, but I want more freedom.
Machine Learning Framework
In 2015, I conceived a framework for automatic machine learning, which is still under development, but will be released soon. Here is the basic framework:
In the framework shown, pink represents the most commonly used route. After we have extracted or streamlined the data into a list, we can proceed to the next step to build a machine learning model.
The most initial step is to define the problem. This can be determined by the label. The researchers first need to make it clear whether your question is binary, multiple-class, multi-tagged, or regression. After defining the problem, we can divide the data into two different parts, as described below, part of the training data, and the other part of the test data.
The distinction between "training" and "testing" of data must be based on data labels. In all of the classification problems, you should try to split the hierarchy. In Python, you can use Scikit-learn to do it easily.
In a regression task, a simple k-fold split should suffice. However, there are some complex methods that tend to keep the consistency of the labels in the training data and the test data.
In the example above, I chose eval_size or size of the validation set as 10% of all data, but you can choose to assign values based on the data you own.
After the data layering is complete, put them aside and don't touch them. Any operations on the training data set will be saved and eventually applied to the test dataset. The test data set should not be confused with the training data set under any circumstances. If you can do this, you will get a very good score. Otherwise, you may be building a model that is not much used and is over-fitted.
The next step is to determine the different variables in the data. In general, we deal with 3 kinds of variables: one is a data variable, a type variable and a variable containing text.
The following is an example of a popular Titanic database:
Here, the label is survival. Previously, we had isolated the labels from the training data. Then we have pclass, sex, embarked. These variables have different levels, so they are kind of variables. Other variables, such as age, sibsp, Parch, and so on, belong to numeric variables. Names are also variables, but according to previous studies, I don't think this is a variable that can be used to predict survival.
First, the data variables are excluded. These variables do not require any processing, and we can use the standard machine learning model to handle them.
We have two ways to deal with the kinds of variables:
Turn the type data into a label
Converting labels to binary variables (one-hot encoding)
Before applying one-hot encoding, remember to use Labelencoder to convert the type into data.
Since the Titanic data does not have a good sample of text variables, let's build a common rule for working with text variables. We can turn all the text variables into one, and then use some algorithms to convert the text into numbers.
The fusion of text variables is as follows:
We can then use Countvectorizer or Tfidfvectorizer:
Tfidfvectorizer's performance has always been better than other tools, and as far as I can see, the following parameters work almost every time:
If you're just using these vectors on the training data set, make sure you've saved them to your hard drive so you can use them later in the test dataset.
Next, we come to the stack storage (Stacker) module. Stacker is not a model stacker, but a feature stacker. After the processing steps mentioned above, different features can be combined and used in the Stacker module.
Before proceeding to the next step, you can use Numpyhstack or sparse hstack to stack all the features horizontally, depending on whether you have sparse or tight features.
This can also be achieved through the Featureunion module, which prevents other processing steps, such as PCA or feature selection, from being required.
Once we have all the features stacked together, we can begin to apply them to the machine learning model. At this stage, your only available model should be based on the ensemble tree. These models include:
Since it has not been standardized, we cannot use a linear model in the above features. To use a linear model, you can use Normalizer or Standardscaler from Scikit-learn. These normalized methods work only in tight features and do not have a good effect in sparse features.
If the above steps come up with a "good" model, we can optimize the hyper-parameters. To prevent the model from being bad, we can optimize it in the following steps:
For the sake of simplification, I will ignore the transformation of LDA and Qda. For high-dimensional data, PCA is often used for decomposition. For other types of data, we have selected 50-60 components.
For text data, after converting the text to a sparse matrix, use Singular Value decomposition (SVD). A TRUNCATEDSVD can be found in the Scikit-learn.
In general, the SVD component that is useful for TF-IDF is 120-200. Exceeding this number may improve performance, but it will not be sustainable and the cost of computing power will increase.
After evaluating the performance of the model, we then expand the database so that we can evaluate the linear model. Standardized and extensible features can be fed into machine learning models or feature selection modules.
The choice of features can be achieved in a variety of ways. The most common is the choice of greedy features (forward or reverse). On the selection of greedy features, we select a feature, train a model and use a modified evaluation value to evaluate the performance of the model. We constantly add or remove one feature after another and gradually record the performance of the model. Then, we select the features with the highest score. It must be stated that this method is not perfect and needs to be changed or modified as required.
Other faster feature selection methods include: Select the best feature from a model. We can observe the sparse of a logical model, or train a random forest to select the best features and then use them on other machine learning models.
Remember to keep a small number of estimator and minimize the parameters so that you don't over-fit.
The selection of features can also be achieved by gradient boosting machines. The effect is good if we use xgboost instead of using GBM in Scikit-learn. Because xgboost are faster and more scalable.
We can also use Randomforestclassifier, Randomforestregressor, and xgboost to make feature selection in a sparse data set.
Another popular approach is based on the Chi-2 feature selection.
Here, we use CHI2 and Selectkbes to select 20 features from the data. This has also become a hyper-parameter that we want to optimize to improve the results of machine learning models.
In this process, do not forget to save any step of your conversion, in the test data set, you will be used.
The next major step is the selection of the model, and the optimization of the hyper-parameters.
The following algorithms are used primarily:
· Random Forest
· Logistic Regression
· Naive Bayes
· Support Vector Machines
· K-nearest Neighbors
· Random Forest
· Linear Regression
What parameters should I optimize? How can I select the most matching parameters? This is the two problems that people think about the most. It is not possible to answer this question without experiencing different models and parameters on a large number of databases. There is also a point where many people are reluctant to share experience in this area. Fortunately, I have a little experience and would like to share:
Rs* refers to the inability to determine the appropriate value
In my opinion, the above models are already optimal in performance and we do not need to evaluate other models. Remind again, remember to save.
Validation is performed in the test data set.
Dry Kaggle Popular | Solve all machine learning challenges with a single framework
Start building with 50+ products and up to 12 months usage for Elastic Compute Service