1. Language Model 2. Attention is all you need (transformer) Principle Summary 3. Elmo parsing 4. openai GPT parsing 5. Bert parsing 1. Preface
Before this article, we have already introduced two successful models of Elmo and GPT. Today we will introduce the new Bert model released by Google. The performance of the systems that use the task-specific architecture exceeds that of many systems, and refresh the current optimal performance record for 11 NLP tasks.
2. Bert Principle
The full name of the Bert model is bidirectional encoder representations from transformers, which is a newLanguage Model. The reason is that it is a new language model because it is used to jointly adjust the bidirectional Transformer in all layers to train the deep bidirectional representation of pre-training.
To learn more about the Bert model, you must first understand the language model. Pre-trained language models play an important role in many natural language processing problems, such as squad Q & A tasks, Named Entity recognition, and emotional recognition. Currently, there are two main strategies for applying pre-trained language models to NLP tasks,One is a feature-based language model, such as the elmo model.;One is a fine-tuned language model, such as openai GPT.. These two types of language models have their own advantages and disadvantages, and the appearance of Bert seems to have integrated all their advantages, so that we can achieve the best effect in many subsequent specific tasks.
2.1 Overall structure of the Bert Model
Bert is a fine-tuned multi-layer bidirectional transformer encoder. The transformer is the same as the original transformer and implements two versions of the Bert model, in both versions, the feed-forward size is set to Layer 4:
Lbertbase: L = 12, H = 768, A = 12, total parameters = 110 m
Lbertlarge: L = 24, H = 1024, A = 16, total parameters = 340 m
The number of layers (that is, the transformer blocks block) is l, the hidden size is h, and the amount of self-attention is.
2.2 Bert model input
The input representation can represent a single text sentence or a pair of text (for example, [question, answer]) in a word sequence. For a given word, the input representation can be composed of three parts: embedding sum. Shows the visual representation of embedding:
Token embeddings indicates the word vector. The first word is the CLS mark and can be used for subsequent classification tasks. For non-classification tasks, word vectors can be ignored;
Segment embeddings is used to differentiate two sentences, because pre-training not only makes language models, but also performs classification tasks with two sentences as input;
Position embeddings is obtained through model learning.
2.3 Bert model pre-training task
The BERT model uses two new unsupervised prediction tasks to pre-train Bert, which areMasked LM and next sentence Prediction:
2.3.1 masked LM
In order to train the bidirectional transformer representation of depth, a simple method is adopted:Partial input words are randomly masked and then predicted for those masked words. This method is called "masked lm" (MLM). The objective of pre-training is to build a language model. The BERT model adopts bidirectional transformer. So why is "bidirectional" adopted? When pre-training the language model to process downstream tasks, we need not only the language information on the left of a word, but also the language information on the right.
During the training process, 15% of tokens in each sequence are randomly masked, and each word is not predicted as cbow in word2vec. MLM randomly masks some words from the input. Its goal is to predict the original words of the hidden words based on their context. Different from the language model pre-training from left to right, the MLM target allows the fusion of context on both sides of the left and right, which allows the pre-training depth to be bidirectional transformer. The transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it must maintain a distributed context representation for each input word. In addition, because random replications only occur in 1.5% of all words, the model's understanding of the language is not affected.
2.3.2 next sentence Prediction
Many sentence-level tasks, such as automated Q & A (QA) and Natural Language Reasoning (NLI), need to understand the relationship between the two sentences. For example, in the above masked lm task, the first step is to process it, 15% of words are covered. In this task, we need to randomly divide the data into two equal-size parts. The two statement pairs in one part of the data are context-consecutive, the two statement pairs in the other part of the data are not consecutive in the context. Then let the transformer model identify the statement pairs, which are continuous and which are discontinuous.
2.4 model comparison
Elmo, GPT, and Bert are both models proposed in recent years, and they both achieved good results. They also complement each other.
The three models are compared as follows:
Looking forward, there are also important models and ideas in NLP, such as word2vec and lstm.
Word2vec, as a milestone progress, has a huge impact on the development of NLP, but word2vec itself is a kind of shallow structure, in addition, the semantic information obtained by the word vectors trained by the training is subject to the window size. Therefore, some scholars have proposed to use the lstm language model to obtain the pretrained word vectors with long-distance dependency, this language model also has its own shortcomings, because it predicts the following based on the sentence information or the following text. intuitively, we need to consider the context information on both sides of the language, but the traditional lstm model only learns one-way information.
3. Summary
Every progress in language models drives NLP development, from word2vec to Elmo, from openai GPT to Bert. Through these developments, we can also see that deep learning is representation learning will be applied to more and more NLP-related tasks in the future. They can fully utilize the massive data at present, then, we train more advanced models based on various task scenarios to Promote the Implementation of AI projects.
(You are welcome to reprint it. Please indicate the source for reprinting. Welcome to the discussion: [email protected])
5. Bert Parsing