Google has done 450,000 different types of text classification, summed up a general "model selection algorithm" ...July 25, 2018 17:43:55Hits : 6
New Wisdom Meta Report
Source: developers.google.com
Compilation: Shaochen, Daming
"Guide" Google's official launch of the "text classification" tutorial. To minimize the process of selecting a text classification model, Google summed up a generic "model selection algorithm" After about 450K of text classification experiments, and attached a complete flowchart, very practical.
Text classification algorithm is the core of various software systems for large-scale processing of text data. For example, the e-mail software uses text categorization to determine whether a message is sent to the Inbox or filtered to a spam folder, and the discussion forum uses text categorization to determine if user comments should be flagged as inappropriate.
The following is an example of two topic classifications (topic classification), a task that categorizes a text document as a predefined set of topics. Most topic classification questions are based on keywords in the text.
Topic classifications are used to mark incoming spam messages, which are filtered into the junk e-mail folder
Another common text categorization is sentiment analysis (sentiment analyses), which is designed to identify the polarity of text content (polarity): The type of opinion it expresses. This can be rated with binary "likes/dislikes" or a finer set of options, such as a rating from 1 stars to 5 stars. Examples of sentiment analysis include analyzing posts on Twitter to determine whether people like black Panther movies, or to infer from Wal-Mart comments that the general public views Nike's new brand.
This guide will teach you some of the key machine learning best practices for solving text categorization problems. You will learn:
Advanced, end-to-end workflow for solving text categorization problems using machine learning (Workflow)
How to choose the right model for the text categorization problem
How to use TensorFlow to implement a model of your choice
Workflow of text categorization
Here are the workflow to solve machine learning problems
Step 1: Collect Data
Step 2: Explore your data
Step 2.5: Select a model *
Step 3: Prepare the data
Step 4: Build, train, and evaluate your model
Step 5: Tuning the Hyper-parameters
Step 6: Deploy the Model
Workflow to solve machine learning problems
"Note" "Select Model" is not a formal step for traditional machine learning workflow; However, choosing the right model for your problem is a key task that will clarify and streamline your work in the next steps.
The "Text classification" guide in Google's crash course on machine learning explains each step in detail and how to implement these steps with textual data. Due to space limitations, this article focuses on step 2.5: How to choose the right model based on the statistical structure of the dataset and provide a complete flowchart based on the important best practices and rules of thumb.
Step 1: Collect Data
Collecting data is the most important step in solving any supervised machine learning problem. How good is the data set that makes up it, and how good is your text classifier.
If you don't have a specific problem that you want to solve, but are interested in exploring text categorization, there are a number of open source datasets available. The following GitHub repo are sufficient to meet your needs:
https://github.com/google/eng-edu/blob/master/ml/guides/text_classification/load_data.py
On the other hand, if you are dealing with a specific problem, you need to collect the necessary data. Many organizations provide public api--for accessing their data, such as the Twitter API or the NY Times API, which you can use to find the data you want.
Here are some important things to keep in mind when collecting data:
If you are using a public API, understand the limitations of the API before using it. For example, some APIs set a limit on query speed.
The more the training sample (known as the example in the remainder of this guide), the better. This will help the model to better generalize.
Ensure that the sample count for each class or topic is not excessively unbalanced. In other words, each class should have a considerable number of samples.
Make sure that the sample covers the possible input space, not just the common case.
In this guide, we will use the IMDB movie review data set to illustrate this workflow. This data set collects the movie reviews that people posted on the IMDB site, along with the corresponding tags ("positive" or "negative"), indicating whether the reviewer liked the movie. This is a typical example of an emotional analysis problem.
Step 2: Explore your data
Load a data set
Check Data
Collect key Metrics
Building and training a model is only part of the workflow. Knowing the characteristics of your data beforehand can help you build a better model. This means not only higher accuracy, but also less training data, or less computational resources.
Step 2.5: Select a model
In this step, we have collected the datasets and have a deep understanding of the key features of the data. Next, based on the metrics we collected in step 2, we should consider which classification model should be used. This means questions such as "How do I present textual data to an algorithm that expects to enter a number?" (This is called data preprocessing and vectorization), "What type of model should we use?", "What configuration parameters should our model use?", and so on.
After decades of research, we have been able to access a large number of data preprocessing and model configuration options. However, a large number of viable alternatives have greatly increased the complexity and scope of the specific problems at hand. Considering that the best option may not be obvious, a thought-out solution is to try to make every possible choice by intuitively excluding some of the choices. However, the cost of doing so is very expensive.
In this guide, we try to minimize the process of selecting a text classification model. For a given dataset, our goal is to find an algorithm that achieves near maximum accuracy while minimizing the computational time required for training. We used 12 datasets to perform a number of (~450k) experiments on different types of issues, especially sentiment analysis and topic classification issues, and to alternate data preprocessing techniques with different model architectures for each dataset. This helps us find the data set parameters that affect the best choice.
The following model selection algorithm and flowchart are a summary of our extensive experiments.
Data preparation and model building algorithms
1. Calculate the number of samples/the number of words in each sample this ratio.
2. If this ratio is less than 1500, mark the text as N-grams and categorize it with a simple MLP model (the left branch of the flowchart below):
A. Decompose the sample into Word n-grams; convert the n-grams into vectors.
B. Rate the importance of vectors and then select the top 20K according to the branch.
C. Build a MLP model.
3. If the ratio is greater than 1500, mark the text as a sequence and categorize it using the SEPCNN model (the right branch of the flowchart):
A. Decompose the sample into words, and select the first 20K words according to the frequency.
B. Convert the sample to a word sequence vector.
C. If the ratio of the original sample/the number of words per sample is less than 15K, a fine-tuned pre-trained SEPCNN model may be used for optimal results.
4. Use different parameter values to measure the performance of the model to find the best model configuration for the data set.
In the following flowchart, a yellow box represents the data and model preparation process. The gray and green boxes represent the options we consider for each process. The green box represents our recommended options for each process.
You can use this flowchart as a starting point for your first experiment because it allows you to get good accuracy at low computational costs. You can continue to refine the initial model in later iterations.
Text Classification Flowchart (click to enlarge View)
This flowchart answers two key questions:
What kind of learning algorithm or model should we use?
How should we prepare the data to effectively learn the relationship between text and tags?
The answer to the second question depends on the answer to the first question; the way we preprocess the data will depend on the model we choose. Models can be broadly divided into two categories: a model that uses word sorting information (a sequence model), and a model (N-gram model) that treats text as a "bags" (sets) of words only.
Sequence models include convolutional neural Networks (CNN), recurrent neural Networks (RNN) and their variants. N-gram models include logistic regression, simple multilayer perceptron (MLP or fully connected neural network), gradient lift tree (gradient boosted trees), and support vector Machine (SVM).
In the experiment, we observed that the ratio of "number of samples" (S) to "number of words per sample" (W) was correlated with the performance of the model.
When the value of this ratio is small (<1500), small multilayer perceptron with N-gram as input (option a) behaves better, or at least as good as the sequence model. MLP is easy to define and understand, and it takes less time to compute than a sequence model.
When the value of this ratio is large (> = 1500), we use the sequence model (option B). In the next steps, you can read the relevant chapters of the selected model directly, depending on the size of the ratio value.
For our IMDB review data set, the ratio of the number of samples/words per sample is below 144. This means that we will create a MLP model.
Step 3: Prepare the data
Step 4: Build, train and evaluate the model
Step 5: Tuning the Hyper-parameters
Step 6: Deploy the Model
Conclusion
Text categorization is a basic problem in machine learning and is involved in various product applications. In this guide, we break down the workflow of text categorization into several steps. For each step, we recommend a custom implementation method based on the characteristics of a specific dataset. In particular, we recommend which model to use to get the model closer to optimal performance, based on the ratio of the number of samples to the number of words in each sample. The other steps are based on the model selection of this step. Following the recommendations in this guide, reference the code and flowchart in the Appendix will help you learn and understand, and quickly get solutions to text categorization problems.
"Text Classification" guide address:
https://developers.google.com/machine-learning/guides/text-classification/
Google has done 450,000 different types of text classification, summed up a general "model selection algorithm" ...