giza++ Run status record and results comparison

Source: Internet
Author: User
Tags comparison

This article mainly records the operation report of giza++, the language model used is CMU-CAM_TOOLKIT-V2, the decoder uses isi-rewrite-decoder-r1.0.0a, the operating system used is Ubuntu9.10, GCC The version is g++4.4.1. are relatively new operating environment, there is a need for learners to refer to. Have not understood can leave a message. In this paper, we refer to Liu Yang's "using existing software to build statistical machine translation System".

                           -----------------------------------------------

giza++ Run status record and results comparison

                                       

Tian Liang University of Macau

                                   

2009-12-15--2009-12-22

                                ----------------------------------------------

This summary is mainly to the operation process of the steps and the occurrence of the record, and the operation of the function of the executable file to explain and explain. The goal is to summarize a detailed operational report so that other researchers trying to run the software can debug it smoothly. The structure of the article gives the process of running success in the part of the body, and in the appendix of the paper, it gives the various anomalies and the explanations of various executables, and finally gives the related papers.

The overall structure of the article is shown below (omitted here):

1. Tool ready to install Linux system, here I am using Ubuntu 9.10,GCC version is 4.4.1. Bilingual Corpus. Here we use 1000 sentences translated in English and Chinese. Cmu-cambridge statistical Language Modeling Toolkit v2. Language Model tool, which is used to generate a language model for decoder invocation. giza++. This uses the latest version of "Giza-pp-v1.0.3.tar", which contains its auxiliary tool "MKCLS", which is used to generate word-class Chinese and English participle tools. Chinese use ictcal, English use a tool that comes with Egypt: TokenizeE.perl.tmpl

2. Processing Corpus

1. Download the Chinese and English corpus

Download from the Internet 1500 sentences (http://www.nlp.org.cn/Chinese Natural Language open platform) is problematic, mainly some Chinese statements occupy two lines, resulting in poor alignment effect, so in advance corrected a bit. In order to be different from the sentence used in the language model, the first 1000 sentences are used as the alignment corpus and the 1500 sentences in English as the language model.

The input form of the source language 1500 sentence--1500.txt is:

For this reason it was often convenient to overlay the geochemical map with a geological map transparency.

For this reason, it is best to use a transparent geological map on the geochemical map.

Trains and tunnels is overlaid with the multicoloured names and slogans of youths.

Young people painted names and slogans of various colors on vehicles and tunnels.

....

2. Separate Corpus

So what we're going to do is separate it, that is, divide the document into Chinese and English, and name them: Chinese and 中文版. In Linux, you can use the following command to separate it:

tianliang@ubuntu:~ $grep ' [aeiou] ' 1500.txt > 中文版

After the separation, there will be four sentences containing Chinese, and then the English language contains Chinese sentences manually deleted.

tianliang@ubuntu:~ $grep-V ' [aeiou] ' 1500.txt > Chinese

The English sentence containing the full-width punctuation in the Chinese is then deleted manually.

Select the separated language file: the first 1000 sentences in Chinese and Chinese as the next input file. and rename the separated 1500 sentences to english1500.

The following are the formats of the separated languages: English and Chinese.

中文版

For this reason it was often convenient to overlay the geochemical map with a geological map transparency.

Trains and tunnels is overlaid with the multicolored names and slogans of youths.

...

Chinese

For this reason, it is best to use a transparent geological map on the geochemical map.

Young people painted names and slogans of various colors on vehicles and tunnels.

...

3. Add a recognition flag for the language model

Since the decoder used in this isi-rewrite-decoder takes an XML file format as an input file, and it takes <s> and </s> as a sign of the difference between statements, in order to allow the decoder to recognize the statement, the language model statements need to be processed beforehand. , which is to add <s> and </s> tags to "english1500". You can use the following steps:

(1) Run the command:

tianliang@ubuntu:~ $CP english1500 English.tag

A replica of english1500 was generated english.tag. Of course, you can use a graphical operation to copy directly. The purpose of this step is to keep the source file in case the source file is damaged after the exception occurs.

(2) using VI to open the English.tag, use the following two commands:

:%s#^#<s> #

:%s#$# </s>#

Note After <s>, there is a space before </s>. This adds a mark to the head and tail of each sentence. Save (Wq. ) and Exit VI.

The resulting tagged file is in the form of:

<s> for this reason it's often convenient to overlay the geochemical map with a geological map transparency. </s>

<s> Trains and tunnels is overlaid with the multicolored names and slogans of youths. </s>

....

3. Installing the Software

1. Download the Software

Download the required software separately

Language Model tools: Cmu-cambridge statistical Language Modeling Toolkit v2

Download Address: http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

Decoder: isi-rewrite-decoder-r1.0.0a.tar.gz

Download Address: http://www.isi.edu/natural-language/software/decoder/

http://download.csdn.net/detail/tianliang0123/3712116

Translation Model tool: Giza-pp-v1.0.3.tar (contains Giza++-v2 and MKCLS-V2)

Download Address: http://code.google.com/p/giza-pp/

Word Breaker Toolkit: EGYPT (it only uses its own English word-breaker feature)

Download Address: http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/

Here you will find a dedicated website for statistical machine translation, where you can learn about the http://www.statmt.org/of the relevant statistical machines, as well as download the various relevant corpora and tools you need.

2. Compiling and installing the software

First, create a folder to place the package for this debug:/home/tianliang/research

Then, copy the unpacked package to the directory. Here decompression can use a graphical operation, directly right-click Unzip, or use the command line, such as decompression EGYPT package: # TAR–XZVF EGYPT.tar.gz.

Finally, enter the directory from the terminal and enter the command "make" to compile.

The detailed procedure is as follows:

(1) put three unpacked packages in the search directory, use the following commands, and query the directory of the package situation:

tianliang@ubuntu:~ $mkdir

tianliang@ubuntu:~ $CD

Tianliang@ubuntu:~/research$ls

CMU-CAM_TOOLKIT_V2 giza-pp isi-rewrite-decoder-r1.0.0a

(2) Installation Cmu-cam_tookit_v2

Before installing this software, to change the code in advance, the method is to enter the package src directory, find makefile, the inside of the "#BYTESWAP_FLAG =-dslm_swap_bytes" in the "#" removed.

TIANLIANG@UBUNTU:~/RESEARCH$CD CMU-CAM_TOOLKIT_V2

TIANLIANG@UBUNTU:~/RESEARCH/CMU-CAM_TOOLKIT_V2$CD SRC

Tianliang@ubuntu:~/research/cmu-cam_toolkit_v2/src$make Install

Gcc-o-dslm_swap_bytes-c-o bo_ng_prob.o bo_ng_prob.c

Gcc-o-dslm_swap_bytes-c-o calc_mem_req.o calc_mem_req.c

Gcc-o-dslm_swap_bytes-c-O compute_back_

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.