Text Translation Based on statistical machine translation

Source: Internet
Author: User

This document describes how to use the niutrans tool for text-to-white translation. By default, niutrans has been installed and the installation directory is niutrans/. The following paths are based on this directory.

The Training Steps of the translation model are divided into four stages: corpus preprocessing, alignment, translation model training, language model training, and parameter adjustment.

I. Corpus preprocessing

The raw data format we get is messy and requires preprocessing to form parallel corpus data with rules.

The pre-processing of speech content includes the uniform punctuation marks, the deletion of irrelevant symbols, the deletion of the paragraph before and after, the sentence before and after spaces, Word Segmentation and other stages. The ultimate form is two parallel files, called src.txtand tgt.txt (if the text is translated as SRC, TGT is white, SRC is white, and TGT is text), each line of the two files is one sentence (natural sentence ), each sentence has been divided into good words, and words are separated by spaces. Each line of the two documents corresponds one to one, so the number of lines is equal.

Considering the differences between Chinese and white words, the sentence word segmentation strategy after sentence Sentence Alignment is: the white text is the common Chinese word segmentation, and the classical text is the one-dimensional word segmentation.

Ii. Alignment

Niutrans adopts the third-party tool Giza ++ for word alignment. Since Giza ++ can only perform unidirectional alignment, niutrans merges the alignment results in two directions to ensure accuracy of the alignment results, the final alignment result is generated for later use.

Specifically, if src = "TGT" is aligned, Giza ++ generates a file named src2tgt. a3.final result file. If TGT = SRC is aligned, Giza ++ generates a file named tgt2src. a3.final result file; niutrans merges the A3 files in both directions to generate the final alignment.txt alignent.txt.

For the above alignment steps, see the script code scripts/niutrans-running-Giza ++. pl.

Note thatWhen Giza ++ performs unidirectional alignment, if the length of the source/Target sentence pair (number of words after word segmentation) is too large, Giza ++ will cut the long side, however, when the target/source is aligned in another direction, it is not necessarily cut. This causes inconsistency between src2tgt. a3.final and tgt2src. a3.final in the two alignment result files. This inconsistency will cause niutrans to exit directly due to an error in the combined generation of alignment.txt.This error is only related to the source/Target length ratio and to the specific language.Therefore, although the question in this article is "steps for translating texts and whitespaces", it is actually a common step for all languages. Therefore, run the script scripts/niutrans-running-Giza ++. pl directly. After Step 8, an error occurs and the training fails.

Solution:After step 7 and before step 8, traverse src2tgt. a3.final and tgt2src. a3.final removes inconsistent sentence pairs from two A3 files and parallel corpus files.

The traversal algorithm is very simple. First, let's look at the A3 file format:

# Sentence pair (20) Source length 11 target length 13 alignment score: 1.14927e-38 Zhiguo requests the taishi family name to be detached from the zhizu family, and the other is supplemented by the null ({2 }) chi ({}) fruit ({1 5 6 7 8}) do not ({10}) Family ({}) at ({3}) Too ({4 }) shi ({}), ({9}) is ({}) Fu ({11 12})'s ({13 })

The above excerpt shows the alignment of a sentence pair. The first action sentence describes the content of the second action target sentence, and the third Action matches the result. [Sentence pair (20)] indicates 20th Sentence Pairs, source length 11 indicates that the source sentence length is 11, and target length 13 indicates that the target sentence length is 13. Traverse each pair (three rows) in two A3 files and check if the source length in the first file is equal to the target length in the second file, in addition, the target length in the first file is equal to the source length in the second file, indicating that the sentence pair is consistent; otherwise, the sentence is inconsistent, remove the sentence pair from the two A3 files and the parallel file.

Paste the processing Code as follows:

Check. py
#-*-Coding: UTF-8-*-''' created on August 1, August 25, 2014 @ Author: wuseguang ''' import sysimport reprint "Script Name:", sys. argv [0] If (LEN (sys. argv )! = 5): Print "the parameter is incorrect" sys. exit () src = sys. argv [1] TGT = sys. argv [2] SRCS = sys. argv [3] tgts = sys. argv [4] errpairs = [] with open (SRC, 'R') as srcfile, open (TGT, 'R') as tgtfile, open (SRC + '. check ', 'w') as srcw, open (TGT + '. check', 'w') as tgtw: Index =-1 pair = 0 flag = true while true: index + = 1 Print Index srcline = srcfile. readline () tgtline = tgtfile. readline () if not srcline or not tgtline: break if index % 3! = 0 or not srcline. startswith ("# sentence pair"): If flag: srcw. write (srcline) tgtw. write (tgtline) Continue srcdata = Re. split ('\ D +', srcline) tgtdata = Re. split ('\ D +', tgtline) # print srcdata # print tgtdata if srcdata [2]! = Tgtdata [3] Or srcdata [3]! = Tgtdata [2]: errpairs. append (index/3 + 1) Flag = false continue pair + = 1 flag = true oldtitle = "# sentence pair (" + srcdata [1] + ") "newtitle = '# sentence pair (' + STR (pair) + ')' print oldtitle print newtitle srcw. write (srcline. replace (oldtitle, newtitle) tgtw. write (tgtline. replace (oldtitle, newtitle) with open (SRCS, 'R') as srcsfile, open (tgts, 'R') as tgtsfile, open (SRCS + '. check ', 'w') as srcw, open (tgts + '. check ', 'w') as tgtw: pair = 0 errset = set (errpairs) # print SRCS +' \ n' # print tgts + '\ n' while true: pair + = 1 # print 'pair: ', pair srcline = srcsfile. readline () tgtline = tgtsfile. readline () if not srcline or not tgtline: break if pair in errset: Continue srcw. write (srcline) tgtw. write (tgtline) print errpairsprint Len (errpairs)

 

Place the file in the scripts directory. Subsequent scripts will be called by yourself.

Then,Comment out the code in Step 8 in the scripts/niutrans-running-Giza ++. pl script.. After the scripts/niutrans-running-Giza ++. pl script is called, check. py is called for consistency check, and niutrans merge and alignment command../bin/niutrans. symalignment is called.

Iii. Translation Model Training

In the step-by-step match result alignment.txt, train the translation model. The training command is:

perl NiuTrans-phrase-train-model.pl -tmdir $workDir/model.phrase/ -s $srcFile -t $tgtFile -a $aligFile

-S refers to the source parallel object file, -trefers to the target parallel object file, and -A refers to the alignment.txt file.

 

Iv. Language Model Training

The language model checks the validity of the target language. Therefore, you only need to use the target language corpus for training. The format is the same as that of the parallel corpus, that is, one sentence per line and no sentence is segmented by spaces. The Training Command is as follows:

perl NiuTrans-training-ngram-LM.pl -corpus $lmodelFile -ngram 3 -vocab $workDir/lm/lm.vocab -lmbin $workDir/lm/lm.trie.data
Lmodelfileis the training material file, which is named by lm.txt.

 

V. Parameter Adjustment

In the parameter adjustment phase, the weights of the two models trained above (the translation model and the language model) are adjusted. The essence is to take these two models as two feature and then set them as feature models.

The Training Command is as follows:

perl NiuTrans-phrase-generate-mert-config.pl -tmdir $workDir/model.phrase/ -lmdir $workDir/lm/ -ngram 3 -o $workDir/NiuTrans.phrase.user.config

 

Vi. Summary of the overall process

For ease of operation, I wrote all the above procedures into a total script named train. SH and put it in the script directory. The content is as follows:

 1 #!/bin/sh 2 scriptDir=$(realpath $PWD) 3 workDir=$(realpath $1) 4 srcFile=${workDir}/preprocessing/$2 5 tgtFile=${workDir}/preprocessing/$3 6 lmodelFile=${workDir}/preprocessing/$4 7 aligFile=$workDir/wordalignment/alignment.txt 8 src2tgtA3File=$workDir/wordalignment/src2tgt.A3.final 9 tgt2srcA3File=$workDir/wordalignment/tgt2src.A3.final10 echo "script_dir is ${scriptDir}"11 echo "work_dir is $workDir"12 echo "src_file is ${srcFile}"13 echo "tgt_file is ${tgtFile}"14 echo "alignFile is $aligFile"15 #exit16 mkdir $workDir/wordalignment -p17 mkdir $workDir/lm -p18 mkdir $workDir/model.phrase -p19 #exit20 cd $scriptDir21 perl NiuTrans-running-GIZA++.pl -src $srcFile -tgt $tgtFile -out $aligFile -tmpdir $workDir/wordalignment/22 cd $scriptDir23 python $scriptDir/check.py $src2tgtA3File $tgt2srcA3File $srcFile $tgtFile24 src2tgtA3File=${src2tgtA3File}.check25 tgt2srcA3File=${tgt2srcA3File}.check26 srcFile=${srcFile}.check27 tgtFile=${tgtFile}.check28 cd $scriptDir29 ../bin/NiuTrans.SymAlignment  $tgt2srcA3File $src2tgtA3File $aligFile30 cd $scriptDir31 perl NiuTrans-phrase-train-model.pl -tmdir $workDir/model.phrase/ -s $srcFile -t $tgtFile -a $aligFile32 cd $scriptDir33 perl NiuTrans-training-ngram-LM.pl -corpus $lmodelFile -ngram 3 -vocab $workDir/lm/lm.vocab -lmbin $workDir/lm/lm.trie.data34 cd $scriptDir35 perl NiuTrans-phrase-generate-mert-config.pl -tmdir $workDir/model.phrase/ -lmdir $workDir/lm/ -ngram 3 -o $workDir/NiuTrans.phrase.user.config

The script content is very clear and will not be detailed in detail.

The premise for running this script is: 1. Step 8 of the niutrans-running-Giza ++. pl script has been commented out; 2. Check. py has been placed in the scripts folder.

Running example: (wenyan.txt zhuhua.txt is located in the./work/preprocessing/directory)

./train.sh ../work/ wenyan.txt baihua.txt lm2.txt

VII. Test

After the model training is complete, you can perform the test. First, prepare the test file test.txt. The format of the test file is the same as that of the parallel file, and it must not overlap with the training corpus. The test command is as follows:

perl NiuTrans-phrase-decoder-model.pl -test $workDir/test/test.txt -c $workDir/NiuTrans.phrase.user.config -output $workDir/test/Xbest.out

-Test indicates the location of the test file,-C indicates the location of the model configuration file trained in the previous step, and-output indicates the location of the translation result file.

Note,To specify multiple Translation results, you need to modify the-nbest parameter for line 56th of the script NiuTrans-phrase-decoder-model.plThe default value is 1.

Text Translation Based on statistical machine translation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.