Open-source tools in statistical Translation Systems

Source: Internet
Author: User

(Adapted from the computer world/2007/October/22/version B15)

 

The significance of open-source tools for statistical translation must be explained. Brown and so on proposed the IBM model was in the early 1990s S, and the year of extensive use and research of the IBM Model Turned out to be after 1999! The reason for this is the emergence of the open-source toolkit! The emergence of open-source tools has lowered the entrance threshold for research, so that we can truly stand on the shoulders of giants for further exploration! Thanks to those open-source tools, remember them and use them to bring our research to the next level!

 

 

 

 

 

I. open-source tools

 

1. The first open-source machine translation toolkit Egypt (including the famous Giza ++)

The statistical machine translation toolkit developed by Egypt consists of four modules. The Giza module used to train word alignment is still widely used. Giza ++ is the ultimate version of Giza. Giza ++ Implements Five models proposed by IBM. Its main idea is to use the EM algorithm for iterative training of bilingual corpus, words are aligned by Sentence Alignment. Giza is language-independent and can train any two languages. This is also one of the advantages of statistical machine translation. Almost all statistical machine translation systems now use this tool for word alignment training. (It is worth mentioning that Giza ++ is a person from Statistics machine translation, Franz Joseph och ), the translation team led by the cow won the first place in the NIST evaluation many times. When the cow met Google, it was even more powerful)

 

2. Language Model Training Tool srilm

Srilm is an open-source toolkit for establishing and using statistical language models. It was developed by the SRI speech Technology and Research Laboratory (SRI speech Technology and Research Laboratory) in 1995 and is still being released, it is widely used in speech recognition, machine translation, and other fields. This toolkit contains a group of C ++ class libraries, a group of executable programs for language model training and applications. It allows you to easily train and apply language models. Given a group of continuous words, call the interface provided by srilm to obtain the probability of this group of words.

 

3. mteval, an automatic evaluation tool for Machine Translation

In some well-known international evaluations of statistical machine translation, automatic and manual evaluations are widely used.

The evaluation conducted by the National Institute of Technology and standards (NIST. Mteval is the automatic evaluation tool they developed, the latest version of mteval-11b.pl, is written in Perl language.

 

4. Maximum Entropy Model Training Tool yasmet

The training tool used to train the maximum entropy is also developed by Franz Joseph och. in a true sense, this is a machine learning toolkit.

 

Ii. Open-Source Systems

1. The first phrase-based statistical machine translation system, Pharaoh)

"Pharaoh" is an early and open statistical machine translation system. It was developed by Philip Koehn of the Information Science Institute of the University of Southern California) I wrote my doctoral thesis in 2004. The "Pharaoh" consists of two parts: Training and decoding. The training process is used to obtain statistical knowledge from the corpus. It uses the existing open source software Giza ++ and srilm, Giza ++ to train word alignment, srilm to train the language model, but decoding is not open source code. The "Pharaoh" principle is simple and easy to use. Its appearance plays a very important role in promoting Machine Translation Research.

 

2. China's first open-source statistical machine translation system, silkroad)

The appearance of Pharaoh unveiled the secret of machine translation statistics. However, the source code of the decoder, which is the core part of it, is still not publicly available. To this end, Chinese researchers have jointly developed a fully open-source statistical machine translation system, "Silk Road ". The system is jointly developed by five research institutions and universities in China (Institute of computing, Institute of automation, Institute of software, Xiamen University, and Harbin Institute of Technology, it was also published at the Second statistical machine translation seminar in China in 2006. Silk Road includes the following modules: corpus pre-processing and post-processing module "Cactus", word alignment module "Loulan", phrase extraction module "Hu HU", and three Decoder ("Camel", "Oasis" and "Business Team "). This is the first time that a complete statistical machine translation system has been published, which has greatly promoted the rapid development of Chinese statistical machine translation.

 

3. Moses)

"Moses" is an upgraded version of the "Pharaoh". It has added many features, including the Edinburgh University, the German Aachen industry.

A phrase-based statistical machine translation system jointly developed by eight organizations, such as the University. Researchers from these organizations held a seminar at the University of johnkins in 2006 to jointly develop the system over six weeks. The entire system is written in C ++. It is fully open source code from training to decoding and can run on Windows and Linux platforms. By the way, we are basically using this Moses Translation System, which is basically a baseline in the translation field.

 

 

 

4. Calf (niutrans)

Developed by the Natural Language Processing Laboratory of Northeastern University in China. The system is all developed by C ++. It runs fast and consumes less memory. However, the system currently only supports (hierarchical) phrase/syntax-based models. As a rising star in the Chinese translation system, I also hope that the Mavericks (niutrans) can do better and better!

 

Open-source tools in statistical Translation Systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.