Thread-Safe Srilm language Model C + + interface

Source: Internet
Author: User

Blog Address: http://blog.csdn.net/wangxinginnlp/article/details/46963659


Older threads are not secure

In recent days, in Daoteng multi-threaded translation decoder. Single-threaded decoder, placed in multi-threaded under the often unprovoked segmentation fault (core dumped). After a day of troubleshooting, the problem with the language model was discovered.

Older versions of Srilm do not support multi-threaded, multiple process environment error. The error is shown as follows:

    1. The language model as a public resource, multiple threads to read, will be reported segmentation fault (core dumped).
    2. The language model is used as a thread resource, and multiple threads go to read and then use it. found that only the first thread could successfully load language model resources, and other language models failed to load language resources. The program does not error, but all the language models in the translation results are divided into 0.
    3. The language model is used as a thread resource, and the process prepares multiple language model resources (that is, new multiple objects). It is then distributed to each thread for use by the thread. This time will be error segmentation fault (core dumped).

In a word, the old version of Srilm cannot be used successfully in multi-threading.


Determine if your srilm is an older version and see your own Srilm interface. If you are reading resources and scoring conditional probabilities for word, respectively,

    1. void *sriloadlm (const char *FN, int arpa = 0, int order = 3, int unk = 0, int tolow=0);
    2. Double Sriwordprob (void *PLM, const char *word, const char *context);

Congratulations, your srilm is in the old version.


New version thread safe

Now the question is, how to determine the new version of thread-safe?

Now go to the Srilm official website (http://www.speech.sri.com/projects/srilm/) to download the new version, after extracting the compressed package in the root directory of the doc directory has a readme-threads. The first paragraph describes it this way.

As of November, Srilm supports multi-threaded applications. This enhancment applies to the five libraries thatComprise Srilm:libmisc, libdstruct, Liboolm, LIBFLM and Liblattice.Please note the This does not imply the all APImethods wasThread-safe, but the rather that it is Possi Ble to perform independent srilm tasks on multiple threadswithout interference or instability. Some APIs that perform read-only calculations is safe to call on objectsGKFX by multiple threads eral this isn't safe, particularly on APIs, mutate data structures notsolely owned by the current thread.We'll attempt to document specific allowances and limitations within this READMEand inline in the code.

Bold Word is the focus, simply said that the new version srilm is read security, writing is not necessarily safe.


But the comparison pit is Google "Srilm interface", "Srilm API" and so on can not get the official interface (found is the old version of the interface), the only exception is (http://blog.csdn.net/mouxiaofeng/article/details/5144750).

Later found in the root directory under the Doc directory has the Lm-intro file. Like the words

API for LANGUAGE MODELS
These programs is just examples of how to use the Object-oriented Language model library currently under CONSTR  Uction. To use the API one would has to read the various. h files and how the interfaces is used in the example PR  Ogams.  NoComprehensive documentation is available as yet. Sorry.

Bold Word is the focus, simply said that the official interface is not provided.


Fortunately, a Python and Perl interface was found on GitHub: see Https://github.com/desilinguist/swig-srilm (later called the Python edition interface). But the computational probability interface given by this version is the probability of N-gram: Getunigramprob,getbigramprob,gettrigramprob,getngramprob or the getsentenceprob of the probability of a sentence. In our decoder, it is necessary to calculate the probability of a contex condition given word after a given language model. Below we will be on the basis of the Python version of the interface to change a C + + interface.


Looking at the srilm.c file for the Python version of the interface, it is easy to know that its language model is a Ngram type. View the Python version of Interface incldue in Ngram.h, find Ngram public inheritance LM, and he has a wordprob (vocabindex word, const vocabindex *context) interface. Exultation, this interface is what we need.



If curious what LM is, view the Python version of Interface incldue in the LM.h file.

The first paragraph is the official document.

LM.h--
* Generic LM interface
* The LM class defines an abstract Languge model interface which all other classes refine and inherit from.

Keep looking and you'll find he also has logp wordprob (vocabindex Word, const vocabindex *context) = 0, and he also has logp wordprob (vocabstring Word, const vocabs Tring *context) interface.



Intuitively Vocabindex is the number of word, and vocabstring is word with the string class. Guess not all right. View Vocab.h with the following



What we are after is LM Wordprob (vocabstring Word, const vocabstring *context). If you can not stand the test now, the direct Python version of the interface srilm to do an LM class wrapper, you will find that the LM read () function error, because it is not implemented at all. View the lm/src/lm.cc file in the Srilm directory after decompression.



Although he implemented the Wordprob (vocabstring Word, const vocabstring *context).


Although he has a possible write operation, the Addunkwords function defaults to Flase


In fact, this interface is not a problem, the individual has not been General Assembly Daoteng Wordprob (vocabstring Word, const vocabstring *context) The second parameter in the multidimensional array char**.


My own solution is to find a way to use Ngram's wordprob rationally. View SRILM.C calculation N-gram probability, is nothing more than the first to divide the N, and then go to vocab to check each word index, finally sent to calculate.

<span style= "FONT-SIZE:18PX;"      >//get generic N-gram probability (up to n=7) float Getngramprob (ngram* Ngram, const char* NGRAMSTR, unsigned order) {     Const char* WORDS[7];     unsigned int indices[order];     int numparsed, Histsize, I, J;     Char* SCP;     float ans;     Duplicate string So, we don ' t mess up the original SCP = Strdupa (NGRAMSTR);                                      Parse the given string into words numparsed = Vocab::p arsewords (SCP, (vocabstring *) words, 7);  Shard if (numparsed! = order) {fprintf (stderr, "Error:given order (%d) does not match number         of words (%d). \ n ", order, numparsed);     return 0;  }//Get indices for the words obtained above, if you don ' t find them, then add them//to the vocabulary and then     Get the indices.                  Swig_srilm_vocab->addwords ((vocabstring *) words, (Vocabindex *) indices, order); Check Word's index (write here, thread unsafe)//Create A history array of size "order" and POpulate IT//compute probability unsigned hist[order];     for (I=order; i>1; i--) {hist[order-i] = indices[i-2];     } Hist[order-1] = Vocab_none;     Compute the ngram probability ans = getwordprob (Ngram, indices[order-1], hist);    Return the representation of log (0) if needed if (ans = = Logp_zero) return Bigneg; return ans;} </span>


The above indicates that the Vocab class Addwords has write operation, thread is unsafe, recommend using getindices, only read operation, thread safety.

In Vocab.h


In vocab.cc



Addwords can not find word when the word will be written to Vocab, seemingly getindices will not have (its function in the implementation of the conditions there are also addword operation, and so on after looking at the next clear confirmation. Experiments have shown that it is thread-safe. )。




Compile interface

    1. Build a srilm C + + interface directory srilm_interface.
    2. Prepare the Include and Lib resources. According to the Python version of the interface, the compilation interface needs to prepare the relevant static library and header files. Copy the root include directory and the Lib directory from the compiled Srilm tool directly to directory Srilm_interface. Or you can specify a path at compile time.
    3. Put our rewritten srilm.h and srilm.cc, and main.cc in Srilm_interface.
    4. Compilation: g++ srilm.h srilm.cc main.cc-i./include/lib/liboolm.a./lib/libdstruct.a/lib/libmisc.a./lib/liblattice.a /libflm.a-lpthread

If you pass, you run the compile results test. If the test results are correct, the C + + interface is OK.











Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Thread-Safe Srilm language Model C + + interface

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.