Post under construction
Introduction
Kaldi “nnet3” is a robust framework for DNN acoustic modelling. In almost all the recipes, you can find examples of different configuration that can be adapted to use it in your own task. However, to understand how to adapt the xconfig file to implement more sophisticated (and not too sophisticated sometimes) ideas is not a process.
nnet3 structure
Nnet3 neural network is constructed using a general graph structure consisting in:
- A list of Components
- A graph structure that specify how the Components are connected
The network construction is based-on a config file where the Components, nodes, inputs and outputs are defined.
TODO- add Index and Cindex descripton
xconfig to config
The xconfig are simplified configuration files to define the structure of the network. This files are parse by using the script steps/nnet3/xconfig_to_configs.py passing the xconfig file and output path. e.g.
config_dir=etc/chain/tdnn/configs
steps/nnet3/xconfig_to_configs.py --xconfig-file $config_dir/network.xconfig \
--config-dir $config_dir/configs/
Layers
This is an explanation how a line in xconfig is parse into the final configuration of the network. Kaldi groups the layers into several kinds. I will list all of the layers. But, I will only detail some of them.
basic_layers
basic_layer.py- input
- output (not real outputs, they just directly map to an output-node in nnet3)
- output_layer (real output layer)
- relu-layer
- relu-renorm-layer
- relu-batchnorm-dropout-layer
- relu-dropout-layer
- relu-batchnorm-layer
- relu-batchnorm-so-layer
- batchnorm-so-relu-layer
- batchnorm-layer
- sigmoid-layer
- tanh-layer
- fixed-affine-layer (is an affine transform that is supplied at network initialization time and is not trainable)
- affine-layer (fully connected layer)
- idct-layer (to convert input MFCC-features to Filterbank features)
- spec-augment-layer
convolution
convolution.py- conv-batchnorm-layer
- conv-renorm-layer
- res-block (residual block as in ResNets)
- res2-block (residual block with post-activations, with no support downsampling)
- SumBlockComponent (For channel averaging)
attention
convolution.py- attention-renorm-layer
- attention-relu-renorm-layer
- attention-relu-batchnorm-layer
- relu-renorm-attention-layer
- SumBlockComponent (For channel averaging)
- or any combination of relu, attention, sigmoid, tanh, renorm, batchnorm, dropout
lstm
lstm.py- lstm-layer
- lstmp-layer
- lstmp-batchnorm-layer (followed by batchnorm)
- fast-lstm-layer
- fast-lstm-batchnorm-layer (followed by batchnorm)
- lstmb-layer
- fast-lstmp-layer
- fast-lstmp-batchnorm-layer
gru
gru.py- gru-layer (Gated recurrent unit)
- pgru-layer (Personalized Gated Recurrent Unit)
- norm-pgru-layer (batchnorm in the forward direction, renorm in the recurrence)
- opgru-layer (Output-Gate Projected Gated Recurrent Unit) paper
- norm-opgru-layer (batchnorm in the forward direction, renorm in the recurrence)
- fast-gru-layer
- fast-pgru-layer
- fast-norm-pgru-layer (batchnorm in the forward direction, renorm in the recurrence)
- fast-opgru-layer
- fast-norm-opgru-layer
stats_layer
stats_layer.py- stats-layer (adds statistics-pooling and statistics-extraction components)
trivial_layers
trivial_layers.py- renorm-component
- batchnorm-component
- no-op-component
- delta-layer
- linear-component
- combine-feature-maps-layer
- affine-component
- scale-component
- offset-component
- dim-range-component
How to..
The order to construct a network definition in kaldi is first, define the network.xconfig file. Second, parse the xconfig to config with steps/nnet3/xconfig_to_configs.py script. Then, run the steps/nnet3/chain/train.py. This pipeline is assuming that all the features and egs files already exists.
The network.xconfig file construction
One way to construct the xconfig file is inserting the lines into the network.xconfig file directly from the run_net.sh script file. In this way, you will be able to set the different parameters using variables, keeping the script organized and easy to modify.
dir=exp/chain/tdnn_sp
mkdir -p $dir/configs
# Definition of the the xconfig
cat <<EOF > $dir/configs/network.xconfig
input ...
...
output-layer ...
EOF
# Parse xconfig to final, init and ref configs
steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig \
--config-dir $dir/configs/
Input layer
In the Kaldi recipes, it is common that the dimension of the input layer is 40 MFCC. In some cases, this is a hard value for the dim parameter in the input layer definition. But sometimes, you may want to experiment with vectors of different size. Therefore, it would be more convenient to have a dynamic value that automatically take the vector size. You can get the vector size by calling feat-to-dim function as:
# Getting features vector dimension
feat_path=data/train_clean_sp_hires
feat_dim=`feat-to-dim scp:${feat_path}/feats.scp -`
# Using the feat_dim in xconfig
input dim=$feat_dim name=input
MFCC features
The most basic input layer in xconfig would be defined with the layer input. It is important that set the name of this layer as input.
input dim=40 name=input
MFCC + iVectors features
If you want to concatenate iVectors with the MFCC, you need to define another input layer called ivector and a fixed-affine-layer. In the following example, the notation inside of the Append function assumes that exists an input-layer named as input, and it will replace the -1,0,1 notation to input[-1], input[0], input[1].
input dim=100 name=ivector
input dim=40 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=foo/lda.mat
Using Filterbanks
Kaldi preffers to save MFCC features because are more condense than the filterbanks features. So, if you, for example, want to train a cnn-tdnn network, you need to transform the MFCC to filterbanks to train the CNN part. To avoid to store both kind of features, in Kaldi exist the idct-layer that converts the MFCC into Filterbanks.
input dim=40 name=input
idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
In the case of the CNN-TDNN example, the order of the layers is important. Probably, you should think about it as a convention.
input dim=100 name=ivector
input dim=40 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
Multiview features
In some scenarios, you may want to add different levels of features, e.g. frame, utterance, speaker, recording party, so on.. To do this you can concatenate the features as:
TODO add the example
Multi-task
I will not explain here how to construct a multi-task learning but, Josh Meyer has a nice template you can follow. https://github.com/JRMeyer/multi-task-kaldi
TDNN layers
The following is an example of a common tdnn definition from librispeech recipe.
relu_dim=725
num_targets=$(tree-info $tree_dir/tree |grep num-pdfs|awk '{print $2}')
learning_rate_factor=$(echo "print (0.5/$xent_regularize)" | python)
cat <<EOF > $dir/configs/network.xconfig
input dim=100 name=ivector
input dim=40 name=input
# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
# the first splicing is moved before the lda layer, so no splicing here
relu-batchnorm-layer name=tdnn1 dim=$relu_dim
relu-batchnorm-layer name=tdnn2 dim=$relu_dim input=Append(-1,0,1,2)
relu-batchnorm-layer name=tdnn3 dim=$relu_dim input=Append(-3,0,3)
relu-batchnorm-layer name=tdnn4 dim=$relu_dim input=Append(-3,0,3)
relu-batchnorm-layer name=tdnn5 dim=$relu_dim input=Append(-3,0,3)
relu-batchnorm-layer name=tdnn6 dim=$relu_dim input=Append(-6,-3,0)
## adding the layers for chain branch
relu-batchnorm-layer name=prefinal-chain dim=$relu_dim target-rms=0.5
output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5
# adding the layers for xent branch
# This block prints the configs for a separate output that will be
# trained with a cross-entropy objective in the 'chain' models... this
# has the effect of regularizing the hidden parts of the model. we use
# 0.5 / args.xent_regularize as the learning rate factor- the factor of
# 0.5 / args.xent_regularize is suitable as it means the xent
# final-layer learns at a rate independent of the regularization
# constant; and the 0.5 was tuned so as to make the relative progress
# similar in the xent and regular final layers.
relu-batchnorm-layer name=prefinal-xent input=tdnn6 dim=$relu_dim target-rms=0.5
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5
EOF
steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/
Terminology
Some of the terms have a link to the definition on the deepai.org website.
- so: Scale and offset
- batchnorm: Batch normalization
- affine: Affine layer
- sigmoid: Sigmoid function
- relu: Rectified linear units
- gru: Gated recurrent unit
- lstm: Long short-term memory
- attention: Attention models
- convolutional: Convolutional neural network
- lda: Linear Discriminat Analyis
Documentation of Components
If you check the final.config file after parse the xconfig, you will see that several components are inserted. Many of then are implicit when the definition of the network.
-
The NaturalGradientAffineComponent component is the Natural Gradient for Stochastic Gradient Descent described in paper.
-
The LinearComponent represents a linear (matrix) transformation of its input, with a matrix as its trainable parameters. It’s the same as NaturalGradientAffineComponent, but without the bias term.
-
The TdnnComponent is a more memory-efficient alternative to manually splicing several frames of input.