Recurrent Neural Network Language Modeling Toolkit Source analysis (three)

Source: Internet
Author: User
Tags save file strcmp truncated

Series PrefaceReference documents:
  1. Rnnlm-recurrent Neural Network Language Modeling Toolkit (click here to read)
  2. Recurrent neural network based language model (click here to read)
  3. EXTENSIONS of recurrent neural NETWORK LANGUAGE MODEL (click here to read)
  4. Strategies for Training Large scale neural Network Language Models (click here to read)
  5. statistical LANGUAGE MODELS BASED on neural NETWORKS (click here to read)
  6. A Guide to recurrent neural Networks and backpropagation (click here to read)
  7. A Neural Probabilistic Language Model (click here to read)
  8. Learning long-term Dependencies with Gradient descent is difficult (click here to read)
  9. Can Artificial Neural Networks learn Language Models? (click here to read)

This article begins with the function implementation, the function of the. cpp file inside the statement of the day, the order of the function I did not go to the organization, in accordance with the order of the document. If you need to map a function, I will note that because of their narrow knowledge, inevitably a lot of mistakes, welcome to see the friend pointed out ~
OK, let's take a look at the beginning of this section, which reads as follows:
#ifdef Use_blasextern "C" {#include <cblas.h>} #endif

there is a cblas.h header file, Blas's full name is basic linear Algebra subprograms, a high-performance math library for vector and Matrix computing, Blas itself is Fortran, Cblas is the C-language interface library of Blas , the Rnnlmlib.cpp file itself is written in C + +, need to call the C language Cblas, so we need to use extern "C" to indicate that the contents of {} need to be compiled and linked in the C language specification, because C + + and C programs are compiled in the target code to killdifferent name rules, extern "C" implements mixed programming of C and C + +. For more details, refer to this blog post.Deep Exploration of the meaning of extern "C" in C + +, as wellinstallation and use of Cblas, you can learn more through these two articles.
The following continues to look at the first function, a function that generates random decimals, as follows:
Real Crnnlm::random (real min, real max) {    return rand ()/(real) rand_max* (max-min) +min;}
here Rand_max is the VC in Stdlib.h in a character constant defined by the macro, #define Rand_max 0x7FFF, its value is 32767, usually in the production of random decimals can be used Rand_max. The RAND () return value inside [0, rand_max],[] represents the closed interval, which is the ability to fetch the boundary value. ThisReturn returns the value range at [min, Max]. If we return a number between [Min, Max], you can use the following statement:return rand ()/(real) (rand_max+1) * (max-min) + min;to return a random integer, you can use the rand ()% integer to get it.
here are a few functions to set the file name, which is easy to understand. For completeness, or stick it out, as follows
Set the file name of the training data void Crnnlm::settrainfile (char *str) {    strcpy (train_file, str);} Set the file name of the validation dataset void Crnnlm::setvalidfile (char *str) {    strcpy (valid_file, str);} Set the file name of the test set void Crnnlm::settestfile (char *str) {    strcpy (test_file, str);} Set the model save file, which is the information that the file uses to store the model, and the various parameters void Crnnlm::setrnnlmfile (char *str) {

The following function is a basic function, in other functions will be used repeatedly, the function is to read a word from the file to Word, but note two points:1. The longest word can not exceed 99 (the last Word Fu De), otherwise it will be truncated2. Training set the end of each sentence is automatically generated </s> as a separate word, copied to Word return, which is also used to determine whether a sentence is the end of the flag.
void Crnnlm::readword (char *word, file *fin) {    int a=0, ch;//feof (file *stream) returns a non-0 number while (!feof (Fin) when the end of the file is reached.    ) {//read a character from the stream to CHCH=FGETC (Fin);//ascii is 13 for carriage return, \ r, that is, go back to the beginning of a line//note \ r differs from \ n, the latter is a newline/\ \ r \ n is mainly in a text file that appears in the IF (ch==13) continue ; if ((ch== ') | | (ch== ' \ t ') | | (ch== ' \ n ')) {if (a>0) {//\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ n]                break;            } If A=0 encounters a newline, that is, the end of the last sentence, here marks the end of the sentence as </s> alone as a word            if (ch== ' \ n ') {                strcpy (Word, char *) "</s> ");                return;            }            else continue;        }        Word[a]=ch;        a++;//too long words will be truncated, too long results word[99] = '        Too ' if (a>=max_string) {//printf "            (" The word found!\n long ");   Truncate too long words            a--;        }    } The end of the string is ' \ n ' with an ASCII code of 0    word[a]=0;}

The following function finds Word, finds the index that returns word in vocab, does not find return-1, the previous variable is just a brief explanation, here is a brief understanding of Word,getwordhash (word), Vocab_hash[],vocab [] The relationship, see figure. Look at the graph and see that given word, you can get Word's index in vocab in the time of O (1): Vacab[vocab_hash[getwordhash (word)]], where hash mapping is the typical way to use space to change time, But there is a problem with hash mapping is the conflict, so there are three layers to find, if there is a conflict, then in the vocab in order to find the time complexity of O (vocab_size).

Returns the hash value of the word int crnnlm::getwordhash (char *word) {    unsigned int hash, A;        hash=0;//How the word hash is computed for    (a=0; A<strlen (word); a++) Hash=hash*237+word[a];//vocab_hash_ The size of the CRNNLM constructor is initialized to 100 million i.e. 100000000    hash=hash%vocab_hash_size;        return hash;}

<span style= "LINE-HEIGHT:36PX;" >int Crnnlm::searchvocab (char *word) {    int A;    unsigned int hash;        Hash=getwordhash (word);    The first level lookup, Vocab_hash[hash]==-1 indicates that the current word is not in vocab    if (vocab_hash[hash]==-1) return-1;//second level lookup, This confirms that the current word has not been broken by other word    (!strcmp (Word, Vocab[vocab_hash[hash]].word)) return Vocab_hash[hash];    Third level find, walk here, explain the current word with other word's hash value conflict, direct linear find for    (a=0; a<vocab_size; a++) {        if (!strcmp (Word, vocab[a]. Word) {//This overwrites the hash value of the currently found word, so that vocab_hash always keeps the hash value of the nearest lookup Word//The more frequently found word, in this way even if the conflict, the next time will be found in O (1)! Vocab_hash[hash]=a;return A;}    } Not found, that is, the word is not within the vocab, that is, Out-of-vocabulary    return-1;} </span>

The following function reads the word that the current file pointer refers to, and returns the index of the word in vocab, regardless of whether the training data, validation data, test data file format is the end of the file blank line, so according to the file content order to find, find to the end of the file must be </s> Then fin is at the end of the file.
int Crnnlm::readwordindex (FILE *fin) {    char word[max_string];    Readword (Word, fin);    if (feof (Fin)) return-1;    return Searchvocab (word);}

Next, this function adds word to vocab, and returns the index that just added word in vocab, and associates Word with Vocab_hash with vocab through the word hash, and you can see that the memory is dynamically managed and the code comments are as follows:
int Crnnlm::addwordtovocab (char *word) {    unsigned int hash;        strcpy (Vocab[vocab_size].word, word);    vocab[vocab_size].cn=0;    Vocab_size++;//vocab is dynamically managed when the array memory is not fast enough, and then expands the array memory by 100 units per increment, each of which is the Vocab_word type    if (vocab_size+2>=vocab_max_ Size) {                Vocab_max_size+=100;//realloc is used to enlarge or reduce the memory, the original content is not changed, the system directly//in the back to find free memory, if not found, will move the previous data to a large enough place// That is realloc may lead to the movement of data, this is the way to see the source side to review some knowledge of C        vocab= (struct Vocab_word *) realloc (vocab, vocab_max_size * sizeof (struct Vocab_word));    }    The hash value of Word is used as the subscript for Vocab_hash, and the integer value corresponding to the subscript is the index    hash=getwordhash (word) for that word in vocab;    vocab_hash[hash]=vocab_size-1;    return vocab_size-1;}

here is an algorithm for selecting sorting, vocab[1] to vocab[vocab_size-1] According to the frequency of their appearance from large to small sort.
void Crnnlm::sortvocab () {    int A, b, Max;    Vocab_word swap;    Note that here the subscript is starting from 1, did not take vocab[0] into account//actually vocab[0] is stored </s>, from the back of the Learnvocabfromtrainfile () can see for    (a=1; a< Vocab_size; a++) {        max=a;        for (b=a+1; b<vocab_size; b++) if (vocab[max].cn<vocab[b].cn) max=b;        Swap=vocab[max];        Vocab[max]=vocab[a];        Vocab[a]=swap;}    }

the function then reads the data from the Train_file, and the relevant data is loaded into Vocab,vocab_hash, where vocab is assumed to be empty.
void Crnnlm::learnvocabfromtrainfile () {char word[max_string];    FILE *fin;    int A, I, TRAIN_WCN; Here the initialization of the Vocab_hash description is not in Vocab Word, its vocab_hash[getwordhash (word)] is 1 for (a=0; a<vocab_hash_size; a++) vocab_ hash[a]=-1;//read the file in binary mode//about the difference between binary and text files, you can refer to this post: http://www.cnblogs.com/flying-roc/articles/1798817.html//    When Train_file is a text file store, that is, the end of the sentence is \ r \ n, the front Readword () function has a conditional statement if it is disposed of \r//if the train_file is binary storage, the end of the sentence is only \ n, so for the character composition of the file is not very different    Fin=fopen (Train_file, "RB"); vocab_size=0;//is vocab[0] is stored </s> addwordtovocab ((char *) "</s>");//Record tokens quantity in Train_file train_wcn=0    ;        while (1) {Readword (Word, fin);                if (feof (Fin)) break;        Train_wcn++;//vocab stored word will not repeat, repeat word to let its frequency add 1 i=searchvocab (word);            if (i==-1) {a=addwordtovocab (word);        Vocab[a].cn=1;    } else vocab[i].cn++;        }//Note that after reading the Train_file, the vocab will be sorted, and you will see that the word classification is helpful sortvocab ();    Select Vocabulary size/*a=0; while (a<vocab_size) {a++;if (vocab[a].cn==0) break; } vocab_size=a;*/if (debug_mode>0) {printf ("Vocab Size:%d\n", vocab_size);p rintf ("Words in train file:%d\n", t    RAIN_WCN);    }//train_words indicates the number of words in the training file TRAIN_WORDS=TRAIN_WCN; Fclose (Fin);}
since the length of this article is similar, the analysis of the next continuation function implementation

Recurrent Neural Network Language Modeling Toolkit Source analysis (three)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.