Minimum error Rate Training

Source: Internet
Author: User
Tags new set

Recently see Franz Josefoch's paper--"Minimum Error rate Training in statistical machinetranslation". First reading articles feel foggy. It was a bit of a prospect to ask for the cows. By the way, look at the online blog, immediately feel the immediate enlightened.

As for this paper, the article on the web with a prawn (click here) is written in great detail. I'm not going to repeat the burden. But to talk about my own understanding.

The first question is why a minimum error rate training is needed. We know from the article above that the study of the parameters in the middle of the model has little effect on the quality of the translation. We need to use the minimum error rate training to adjust the weight parameters above the optimization set (Tuning Date) (note some of the weight parameters in the middle of the model where the face is adjusted). At the end of the day, it is the scoring mechanism we have in generating translations and the scoring mechanism for our evaluation of translations (eg. BLUE, NIST ... ) is different, which leads us to score better when decoding, and sends the evaluation to find that they are not optimal, that is, the mismatch caused by two different mechanisms. So that we need to add a scoring mechanism when the translation is generated to supervise our training, so that our generated translation on the one hand in the translation generation scoring mechanism to perform well, on the other hand, in the final evaluation score performance is also good. That's why we need to introduce minimal error rate training.

Second question, how to introduce the scoring mechanism of this evaluation translation to supervise our training and help us to optimize it.

Wikipedia says:

Minimize error Rate Training

Minimized error rate training optimizes the given optimization criteria by optimizing the feature weights on the second part of the prepared data-optimization set (Tuning set). Common optimization criteria include information entropy, bleu,ter and so on. This phase requires the decoder to decode the optimization set multiple times, each decoding produces n highest scoring results, and adjusts the feature weights. When the weights are adjusted, the order of the N results will also change, and the highest score, that is, the decoding result, will be used to calculate the Bleu score or ter. When a new set of weights is obtained and the score of the entire optimization set is improved, the next round of decoding is re-performed. This is repeated until new improvements are not observed.

Before talking about how to do it we need to know that it is necessary to adjust the formula from the prawn to the article we also know. But in this adjustment, we need to adjust a set of parameters, not one, we assume N. What do you do? That's how it's settled in paper. First, select a set of initial values for the parameters we want to adjust. Fix the other n-1 parameters, just leave it alone and help them as constants. Focus on one of the parameters R. So the scoring mechanism that we generate when we translate a sentence can be a unary function of R (because everything else becomes constant). For a sentence to be translated, as r changes, the order of the candidate translations (candidate translation) will change, and the selected translation will change. As shown in figure:

* * Note: We want to select the highest score of the candidate, in the 0-r1 interval we chose the target E3, in the R1-r2 interval, we selected the translation is E1, in the interval behind the R2 we chose the translation is E2. From this graph we know that different intervals we choose are different, we may have different vertices for different sentences.

Now that we have all the translations in a coordinate chart, we have such a fancy puzzle.


* * Note: The S1 in the figure is only one translated sentence. He was probably the highest score in the picture.

Now get this graph, that is, in different R time, we each sentence corresponding to the different of the respective optimal translation.

At this stage, no scoring mechanism has been introduced to evaluate the translation. Now we're going to start introducing this thing.

At each R value (R1 R2) We calculate the error statistics (for example, blue value or NIST value) for each sentence separately. Then we accumulate the error statistics of the sentence and get the number of error statistics of our whole optimization set at R. Get the following figure

* * Note: Different R values, our optimization sets have different error statistics values. We select the minimum error statistic value (minimum error rate) R approximation to think of this value as optimal.

So in this step, the estimate of R is complete. The following is the method used to estimate the remaining n-1 parameters.

Then continue with this step and iterate. Until we do not have a big change in the statistical value of the error. We get a set of ideal parameter values.


At this point, over. Using this set of parameters, we can get good results in the translation evaluation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.