Recurrent neural network language modeling toolkit source code (8), recurrentneural

Source: Internet
Author: User

Recurrent neural network language modeling toolkit source code (8), recurrentneural
References:

  1. RNNLM-Recurrent Neural Network Language Modeling Toolkit (Click here to read)
  2. Recurrent neural network based language model (read here)
  3. Extensions of recurrent neural network language model (Click here to read)
  4. Strategies for Training Large Scale Neural Network Language Models (Click here to read)
  5. Statistical language models based on neural networks (Click here to read)
  6. A guide to recurrent neural networks and backpropagation (Click here to read)
  7. A Neural Probabilistic Language Model (read here)
  8. Learning Long-Term Dependencies with Gradient Descent is Difficult (Click here to read)
  9. Can Artificial Neural Networks Learn Language Models? (Click here to read)

Because testNbest () and testGen () are not checked, there are two main functions left, one is the training function and the other is the test function. Both of these functions call the previously described functions. During training, every time a file is trained, the trained model will be immediately interviewed on the valid file to see how it works. If the training effect is good, continue to train files with the same learning rate. If the training effect is not improved much, the learning rate will be reduced to half. Continue learning until there is not much improvement, and no training will be conducted. How to view this effect is the confusion of model training on valid. The test function directly calculates all logarithm probabilities and converts the trained model to PPL in the test file, and stores a dynamic model concept, that is, while testing, you can also update the network parameters. In this way, you can update the model parameters in the test file. One of the most important computations is PPL. The following formula is the PPL formula, so that it can be pasted to compare with the code section:


This is a formula for calculating the confusions of a sequence w1w2w3... wk. The value of c following the formula is 10 for the program, which will be seen later in the code. The following code and comments are directly posted:
// Training Network void CRnnLM: trainNet () {int a, B, word, last_word, wordcn; char log_name [200]; FILE * fi, * flog; // at time. in h, typedef long clock_t start, now; // sprintf (log_name, "%s.output.txt", rnnlm_file); printf ("Starting training using file % s \ n", train_file ); starting_alpha = alpha; // open the rnnlm_file file fi = fopen (rnnlm_file, "rb"); if (fi! = NULL) {// open successfully, that is, the trained file model fclose (fi); printf ("Restoring network from file to continue training... \ n "); // restore the model information in rnnlm_file to restoreNet ();} else {// rnnlm_file failed to open/read data from train_file. The related data will be loaded into vocab, vocab_hashlearnVocabFromTrainFile (); // allocates memory and initializes the network initNet (); // iter indicates the number of training times of the entire training file iter = 0;} if (class_size> vocab_size) {printf ("WARNING: number of classes exceeds vocabulary size! \ N ") ;}// counter meaning: the word being trained is the counter word in train_file. counter = train_cur_pos; // saveNet (); // The outermost loop, loop once indicates that the entire training file has completed training once. Use iter to indicate while (iter <maxIter) {printf ("Iter: % 3d \ tAlpha: % f \ t", iter, alpha); // fflush (stdout) refreshes the standard output buffer, print the items in the output buffer to the standard output device // immediately output the content to be output above fflush (stdout); // initialize bptt_history, history if (bptt> 0) for (a = 0; a <bptt + bptt_block; a ++) bptt_history [a] = 0; for (a = 0; a <MAX_NGRAM_ORDER; a ++) hi Story [a] = 0; // training phase // clear the ac of the neuron, the er value netFlush (); // open the TRAINING file fi = fopen (train_file, "rb "); // 0 indicates the end of a sentence in vocab, that is, </s> last_word = 0; // todo if (counter> 0) for (a = 0; a <counter; a ++) word = readWordIndex (fi ); // this will skip words that were already learned if the training was interrupted // records the start time of each corpus training = clock (); while (1) {counter ++; // The following information is output only if (counter % 10 thousand) = 0) if (debug_mode> 1) {no W = clock (); // train_words indicates the number of words in the training file if (train_words> 0) // The first output % c, followed by 13 represents the ASCII of the carriage return key. Note that it is different from 10 of the line feed key. // I do not know much about entropy, so I don't quite understand the specific meaning of train entropy. // Progress indicates the position of the word currently being trained in the entire training file, that is, the training progress // Words/sec indicates the number of wordprintf ("% cIter: % 3d \ tAlpha: % f \ t TRAIN entropy: %. 4f Progress: %. 2f % Words/sec: %. 1f ", 13, iter, alpha,-logp/log10 (2)/counter, counter/(real) train_words * 100, counter/(double) (now-start) /1000000.0); elsepri Ntf ("% cIter: % 3d \ tAlpha: % f \ t TRAIN entropy: %. 4f Progress: % dK ", 13, iter, alpha,-logp/log10 (2)/counter, counter/ 1000); fflush (stdout );} // indicates each training anti_k word. The network information is saved to rnnlm_fileif (anti_k> 0) & (counter % anti_k) = 0) {train_cur_pos = counter; // save all network information to rnnlm_filesaveNet ();} // read the next word. This function returns the subscript word = readWordIndex (fi) of the next word in vocab ); // read next word // when the first word in the training file is counter = 1, last_word indicates the end of a sentence computeNet (Last_word, word); // compute probability distribution if (feof (fi) break; // end of file: test on validation data, iterate till convergence // logp indicates the cumulative logarithm probability, that is, logp = log10w1 + log10w2 + log10w3... if (word! =-1) logp + = log10 (neu2 [vocab [word]. class_index + vocab_size]. ac * neu2 [word]. ac); // The first condition is not understood. The second condition seems to be the newly added mathematical function of isinf (x) C99, if x returns a non-0 macro value infinitely, // determines if the value is incorrect. if (logp! = Logp) | (isinf (logp) {printf ("\ nNumerical error % d % f \ n", word, neu2 [word]. ac, neu2 [vocab [word]. class_index + vocab_size]. ac); exit (1);} if (bptt> 0) {// shift memory needed for bptt to next time step // move here, the result is that bptt_history stores wt, WT-1, wt-2 from subscript 0... for (a = bptt + bptt_block-1; a> 0; a --) bptt_history [a] = bptt_history [A-1]; bptt_history [0] = last_word; // move here, the result is that bptt_hidden stores st, ST-1, ST-2 from subscript 0... fo R (a = bptt + bptt_block-1; a> 0; a --) for (B = 0; B <layer1_size; B ++) {bptt_hidden [a * layer1_size + B]. ac = bptt_hidden [(A-1) * layer1_size + B]. ac; bptt_hidden [a * layer1_size + B]. er = bptt_hidden [(A-1) * layer1_size + B]. er ;}/// reverse learning, adjustment parameter learnNet (last_word, word); // copy the acvalue of the hidden layer neuron to the part of layer1_size after the output layer, that is, s (t-1) copyHiddenLayerToInput (); // prepare to encode the input layer of the next word if (last_word! =-1) neu0 [last_word]. ac = 0; // delete previous activation last_word = word; // move, the result is that history is stored from the subscript 0 wt, WT-1, wt-2... for (a = MAX_NGRAM_ORDER-1; a> 0; a --) history [a] = history [A-1]; history [0] = last_word; // word = 0 indicates the end of the current sentence. If independent is not 0, it indicates that each sentence is required to be trained independently. // whether the control surface trains a sentence independently. If independent = 0, on the surface, when a sentence is trained on the next sentence, it is counted as historical information. // This control depends on the relevance between the sentence and the sentence. if (independent & (word = 0 )) netReset ();} // close the file (train_file) fclose (fi); now = Clock (); // output the training information of the entire file. For details, see printf ("% cIter: % 3d \ tAlpha: % f \ t TRAIN entropy: %. 4f Words/sec: %. 1f ", 13, iter, alpha,-logp/log10 (2)/counter, counter/(double) (now-start)/1000000.0 )); // The training file will only be executed once, and then save if (one_iter = 1) {// no validation data are needed and network is always saved with modified weightsprintf ("\ n"); logp = 0; //// save all information about the network to rnnlm_filesaveNet (); break;} // validate PHASE // train it again and perform verification below to make Use early-stopping // note that the following content is different from the above train phase: calculate the probability distribution and test the probability of the entire validation file, there will be no learNet part below. If there is an ac that belongs to the dynamic models // clear the neuron, the er value is netFlush (); // open the validation data file fi = fopen (valid_file, "rb"); if (fi = NULL) {printf ("Valid file not found \ n"); exit (1) ;}// open the file in AB mode: B Indicates the binary mode. // a indicates that if the file does not exist, the file will be created. If the file exists, the written data will be added to the end of the file. // the string in log_nameis rnnlm_file.output.txt flog = fopen (log_name, "AB"); if (flog = NULL) {printf ("Cannot open log File \ n "); exit (1);} // fprintf (flog," Index P (NET) Word \ n "); // fprintf (flog, "-------------------------------- \ n"); last_word = 0; logp = 0; // The meaning of wordcn is the same as that of counter, except that wordcn does not include the word wordcn = 0; while (1) {// read the next word. This function returns the subscript word = readWordIndex (fi) of the next word in vocab. // calculate the probability distribution of the next word computeNet (last_word, word ); if (feof (fi) break; // end of file: report LOGP, PPL if (word! =-1) {// logp indicates the cumulative logarithm probability, that is, logp = log10w1 + log10w2 + log10w3... logp + = log10 (neu2 [vocab [word]. class_index + vocab_size]. ac * neu2 [word]. ac); wordcn ++;}/* if (word! =-1) fprintf (flog, "% d \ t % f \ t % s \ n", word, neu2 [word]. ac, vocab [word]. word); elsefprintf (flog, "-1 \ t0 \ t \ tOOV \ n"); * // learnNet (last_word, word ); // *** this will be in implemented for dynamic models // copy the acvalue of the hidden layer neuron to the layer1_size part after the output layer, that is, s (t-1) copyHiddenLayerToInput (); //// prepare to encode the input layer of the next word if (last_word! =-1) neu0 [last_word]. ac = 0; // delete previous activation last_word = word; // move, the result is that history is stored from the subscript 0 wt, WT-1, wt-2... for (a = MAX_NGRAM_ORDER-1; a> 0; a --) history [a] = history [A-1]; history [0] = last_word; // word = 0 indicates the end of the current sentence. If independent is not 0, it indicates that each sentence is required to be trained independently. // whether the control surface trains a sentence independently. If independent = 0, on the surface, when a sentence is trained for the next sentence, it is counted as if (independent & (word = 0) netReset () ;}fclose (fi ); // indicates train_file fprintf (flog, "\ nit Er: % d \ n ", iter); fprintf (flog," valid log probability: % f \ n ", logp); // you cannot understand exp10 () where does this function come from? I am not sure what it means. I hope you can tell me what it means ~ // But according to the PPL definition, it is not difficult to find out what exp10 means. For details, see the PPL formula. In the formula, we can take the constant c = 10. // so exp10 (x) 10 ^ (x) means fprintf (flog, "PPL net: % f \ n", exp10 (-logp/(real) wordcn); fclose (flog ); // entropy is not familiar. I have not learned about printf ("VALID entropy: %. 4f \ n ",-logp/log10 (2)/wordcn); counter = 0; train_cur_pos = 0; // l in front of llogp indicates the last time. // The judgment here indicates that if the training result is not as good as the last time, restore to the last time // otherwise Save the current network if (logp <llogp) restoreWeights (); else saveWeights (); // The larger the logp is, the better the training is. // At the beginning, min_improv Ement = 1.003, alpha_divide = 0 // It indicates that if the training effect is not so significant (improved by min_improvement times), the training effect will be significant. // when the training effect is relatively significant, alpha remains unchanged // refer to the 30th page of the original paper for more details. if (logp * min_improvement <llogp) {// if there is no significant improvement, enable alpha_divide to control if (alpha_divide = 0) alpha_divide = 1; else {// if no significant improvement is made and the alpha_divide switch is enabled, exit the training, at this time, it indicates that the training is good. if saveNet (); break ;}/// is not significantly improved, the learning rate will be reduced by half if (alpha_divide) alpha/= 2; llogp = logp; logp = 0; iter ++; saveNet () ;}// test network void C RnnLM: testNet () {int a, B, word, last_word, wordcn; FILE * fi, * flog, * lmprob = NULL; real prob_other, log_other, log_combine; double d; // restore the model information in rnnlm_file to restoreNet (); // when use_lmprob is equal to 1, it indicates that other trained language models are used if (use_lmprob) {// open the model file of another language, lmprob = fopen (lmprob_file, "rb");} // test phase // netFlush (); // open the test file fi = fopen (test_file, "rb"); // sprintf (str, "s.20.s.output.txt", rnnlm_file, test_file); // flog = Fopen (str, "wb"); // stdout is a file pointer. C has been defined in the header file. You can use it directly and assign it to another file pointer, in this way, the first parameter of the standard output // printf is actually set to stdout flog = stdout; if (debug_mode> 1) {if (use_lmprob) {fprintf (flog, "Index P (NET) P (LM) Word \ n"); fprintf (flog, "-------------------------------------------------- \ n");} else {fprintf (flog, "Index P (NET) word \ n "); fprintf (flog," ---------------------------------- \ n ") ;}// the end of a sentence is marked as 0 in vocab. </s> That is, at the beginning of last_word, It is equal to end of sentence last_word = 0; // The cumulative probability of rnn on the logarithm of the test file logp = 0; // other language models accumulate the logarithm probability of the test file: log_other = 0; // the logarithm accumulation probability of the combination of rnn and other language models: log_combine = 0; // probability of a word in other language models: prob_other = 0; // The meaning of wordcn is the same as the counter in trainNet, except that wordcn does not include the word wordcn = 0 in OOV; // copy the values of neurons in the hidden layer to the part of layer1_size after the output layer, that is, s (t-1) copyHiddenLayerToInput (); // clear the historical information if (bptt> 0) for (a = 0; a <bptt + bptt_block; a ++) bptt_history [a] = 0; for (a = 0; a <MAX_NGRAM_ORDER; A ++) history [a] = 0; if (independent) netReset (); while (1) {// read the next word, this function returns the subscript word = readWordIndex (fi) of the next word in vocab; // calculates the probability distribution of the next word computeNet (last_word, word); if (feof (fi) break; // end of file: report LOGP, PPL if (use_lmprob) {fscanf (lmprob, "% lf", & d); prob_other = d; goToDelimiter ('\ n ', lmprob);} // log_combine interpolation if (word! =-1) | (prob_other> 0) {if (word =-1) {// I don't know why to punish logp + =-8; // some ad hoc penalty-when mixing different vocabularies, single model score is not real PPL // interpolation log_combine + = log10 (0 * lambda + prob_other * (1-lambda ));} else {// calculate the cumulative logarithm probability of rnn logp + = log10 (neu2 [vocab [word]. class_index + vocab_size]. ac * neu2 [word]. ac); // interpolation log_combine + = log10 (neu2 [vocab [word]. class_index + vocab_size]. ac * neu2 [word]. ac * lambda + p Rob_other * (1-lambda);} log_other + = log10 (prob_other); wordcn ++;} if (debug_mode> 1) {if (use_lmprob) {if (word! =-1) fprintf (flog, "% d \ t %. 10f \ t %. 10f \ t % s ", word, neu2 [vocab [word]. class_index + vocab_size]. ac * neu2 [word]. ac, prob_other, vocab [word]. word); else fprintf (flog, "-1 \ t0 \ t \ t0 \ t \ tOOV");} else {if (word! =-1) fprintf (flog, "% d \ t %. 10f \ t % s ", word, neu2 [vocab [word]. class_index + vocab_size]. ac * neu2 [word]. ac, vocab [word]. word); else fprintf (flog, "-1 \ t0 \ t \ tOOV");} fprintf (flog, "\ n ");} // The dynamic model allows rnn to learn and update the parameter if (dynamic> 0) {if (bptt> 0) {// move bptt_history back to a position, load the most recent word into the bptt_history first location for (a = bptt + bptt_block-1; a> 0; a --) bptt_history [a] = bptt_history [A-1]; bptt_history [0] = last_word; // Tt_hidden moves a position back and leaves the first position. The assignment of the first position is in learnNet for (a = bptt + bptt_block-1; a> 0; --) for (B = 0; B <layer1_size; B ++) {bptt_hidden [a * layer1_size + B]. ac = bptt_hidden [(A-1) * layer1_size + B]. ac; bptt_hidden [a * layer1_size + B]. er = bptt_hidden [(A-1) * layer1_size + B]. er ;}/// learning rate alpha = dynamic; learnNet (last_word, word ); // dynamic update} // copy the values of neurons in the hidden layer to the part of layer1_size after the output layer, that is, s (t-1) copyHiddenLayerToInput (); // prepare Encode if (last_word! =-1) neu0 [last_word]. ac = 0; // delete previous activation last_word = word; // move the history of the ME part to a position, and place the first position in the nearest word for (a = MAX_NGRAM_ORDER-1; a> 0; a --) history [a] = history [A-1]; history [0] = last_word; // This is the same as if (independent & (word = 0) netReset () ;}fclose (fi); if (use_lmprob) fclose (lmprob ); // output the test file information here // write to log file if (debug_mode> 0) {fprintf (flog, "\ ntest log probability: % f \ n ", logp); if (use_lmprob) {fprintf (flog, "test log probability given by other lm: % f \ n", log_other); fprintf (flog, "test log probability % f * rnn + % f * other_lm: % f \ n", lambda, 1-lambda, log_combine);} fprintf (flog, "\ nPPL net: % f \ n ", exp10 (-logp/(real) wordcn); if (use_lmprob) {fprintf (flog," PPL other: % f \ n ", exp10 (-log_other/(real) wordcn); fprintf (flog, "PPL combine: % f \ n", exp10 (-log_combine/(real) wordcn ));}} fclose (flog );}

Now, the source code of rnnlm toolkit has come to an end, and the content will certainly be a lot of points that you do not understand correctly. You are also welcome to point out and discuss them together, because the diagrams are too scattered in each article, I will post the internal data structure of rnnlm toolkit as a single article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.