Data format and data loading options in PDNN

Source: Internet
Author: User

Eat someone's mouth short, take someone else's hand long, use other people's things do not BB, according to the rules.

The training and validation data is specified in the command line as a variable as follows:

--train-data "train.pfile,context=5,ignore-label=0:3-9,map-label=1:0/2:1,partition=1000m"--valid-data " Valid.pfile,stream=false,random=true "

The name of the file is specified in the section before the first comma, if any.

A global style wildcard can be used to specify multiple files (Kaldi data files are not currently supported).

Data files can also be compressed with gzip or bz2, in which case the original extension is followed by an extension such as ". Gz" or ". bz2".

After the file name, you can specify any number of data mount options in the format "Key=value". The features of these options are described in the following sections.

Supported data formats

PDNN currently supports 3 data formats: Pfiles,python pickle files, and Kaldi files.

Pfiles

Pfile is the ICSI feature file archive format. Pfiles. pfile as the extension. A pfile can store multiple statements, each of which is a sequence of frames.

Each frame is associated with a feature vector and has one or more labels. The following is an example of a pfile file.

Sentence ID Frame ID Feature Vector Class Label
0 0 [0.2, 0.3, 0.5, 1.4, 1.8, 2.5] 10
0 1 [1.3, 2.1, 0.3, 0.1, 1.4, 0.9] 179
1 0 [0.3, 0.5, 0.5, 1.4, 0.8, 1.4] 32

For speech processing, statements and frames correspond to discourse and frames, respectively. Frames are indexed in every sentence.

For other applications, you can use forged statement indicators and frame indicators.

For example, if you have n instances, you can set all the statement metrics to 0, and the frame metrics from 0 to N-1.

A standard pfile toolbox is pfile_utils-v0_51. This script will be installed automatically if you are running on Linux. HTK users can use this Python script to convert HTK features and tags into pfiles. For more information, refer to the comments above. Python Pickle Files

Python Pickle files may have the extension ". Pickle" or ". Pkl". A Python pickle file serializes a tuple of numpy arrays, (feature, label). There is no notion of ' sentences ' in pickle files; In the other words, a pickle files stores exactly one sentence. feature is a 2-d numpy array, where each row is the feature vector of one instance; The label is a 1-d numpy array, where each element is the class label of one instance.

To read a (gzip-compressed) pickle file in Python:

> Import Cpickle, NumPy, gzip> with Gzip.open (' filename.pkl.gz ', ' RB ') as f:>     feature, label = Cpickle.load (f )

To create a (gzip-compressed) pickle the file in Python:

> Import Cpickle, numpy, gzip> feature = Numpy.array ([[0.2, 0.3, 0.5, 1.4], [1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1. 4]], Dtype = ' float32 ') > Label = Numpy.array ([2, 0, 1]) > with Gzip.open (' filename.pkl.gz ', ' WB ') as f:>     Cpick Le.dump ((feature, label), F)
Kaldi Files

The Kaldi data files accepted by PDNN is "Kaldi script files" with the extension ". SCP". These files contain "pointers" to the actual feature data stored in "Kaldi archive files" ending in ". Ark". Each line of a Kaldi script file specifies the name of an utterance (equivalent to a sentence in pfiles) , and its offset in a Kaldi archive file, as follows:

utt01 train.ark:15213

Labels corresponding to the features is provided by "Alignment files" ending ". Ali". To specify a alignment file, use the option "Label=filename.ali". Alignment files is plain text files, where each line specifies the name of a utterance, followed by the label of each FR Ame in this utterance. Below is an example:

Utt01 0 51 51 51 51 51 51 48 48 7 7 7 7 51 51 51 51 48
On-the-fly Context Padding and Label manipulation

Oftentimes, we want to include the features of neighboring frames into the feature vectors of the current frame. Of course this can is done if you prepare the data files and this would bloat their size. A more clever-to-perform this "context padding" on the fly. PDNN provides the option " context " to does this. Specifying " context=5 " would pad each frame with 5 frames in either side, so that the feature vector becomes one ti Mes the original dimensionality. Specifying " context=5:1 " would pad each frame with 5 frames on the left and 1 frame on the right. Alternatively, you can also specify " lcxt=5,rcxt=1 ". Context padding does not cross sentence boundaries. At the beginning and end of each sentence, the first and last frames is repeated when the context reaches beyond the sent ence boundary.

Some frames in the data files could be garbage frames (i.e. they does not belong to any of the classes-be classified), B UT they is important in making up the context for useful frames. to ignore such frames, you can assign a special class label (say C ) to these frames, and specify the option " IG nore-label= C ". The garbage frames would be discarded; But the context of neighboring frames would still be correct, as the garbage frames is only discarded after context padding happens. Sometimes also want to train a classifier for only a subset of the classes in a data file. In such cases, the Specify multiple class labels to be ignored, e.g. " ignore-label=0:2:7-9 ". Multiple class labels is separated by colons; Contiguous class labels is specified with a dash.

When training a classifier of n classes, PDNN requires this their class labels be 0, 1, ..., N-1. When you ignore some class labels, the remaining class labels could not form such a sequence. In this situation, the "map-label" option to map the remaining class labels to 0, 1, ..., N- 1. For example, to maps the classes 1, 3, 4, 5, 6 to 0, 1, 2, 3, 4, can specify "map-label=1:0/3:1/4:2/5:3/6:4 ". Each pair of labels is separated by a colon; Pairs is separated by slashes. The label mapping happens after unwanted labels is discarded; All the mappings be applied simultaneously (therefore Class 3 is mapped to Class 1 and was not further mapped to Class 0). Also use this option to merge classes. For example, "map-label=1:0/3:1/4-6:2" would map all the labels 4, 5, 6 to Class 2.

partitions, streaming and Shuffling

The training/validation corpus is too large to fit in the CPU or GPU memory. Therefore They is broken down into several levels of units:files, partitions, and minibatches. Such division happens after context padding and label manipulation, and the concept of "sentences" is no longer relevant. As a result, a sentence may is broken into multiple partitions of minibatches.

Both the training and validation corpora may consist of multiple files that can is matched by a single glob-style pattern. At any of the held in the CPU memory. This means if you have multiple files and all the files would be reloaded every epoch. This can is very inefficient; You can avoid this inefficiency by lumping all the data to a single file if they can fit in the CPU memory.

A partition is the amount of data, which is the Federal Reserve to the GPU at A time. for pickle files, partition a entire file ; For other files, specify the partition size with the option " partition ", e.g. " partition=1000m ". The partition size is specified in megabytes (220bytes); The suffix " m " is optional. The default partition size is a-MB.

Files May is read in either the "stream" or the "Non-stream" mode, controlled by the option " stream=true " or " Stream=false . In the Non-stream mode, a entire file is kept in the CPU memory. If there is a file in the Training/validation corpus, the file was loaded only once (and this is efficient). In the stream mode, only a partition are kept in the CPU memory. This was useful when the corpus was too large to fit in the CPU memory. Currently, Pfiles can be loaded in either the stream mode or the Non-stream mode; pickle files can is only is loaded in th e non-stream mode; Kaldi files can only is loaded in the stream mode.

It is usually desirable that instances of different classes being mixed evenly in the training data. To achieve this, specify the option "random=true". This options shuffles the order of the training instances loaded to the CPU memory at a time:in the stream mode, Instan CES is shuffled partition by partition; In the Non-stream mode, instance is shuffled across an entire file. The latter achieves better mixing, so it's again recommended to turn off of the stream mode when the files can fit in the CP U memory.

A Minibatch is the amount of data consumed by the training procedure between successive updates of the model parameters. The Minibatch size is not specified as a data loading option, but as a separate command-line argument to the training SCRI Pts. A partition may not consist of a whole number of minibatches; The last instances of each partition. is not enough to make a minibatch is discarded.

Data format and data loading options in PDNN

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.