CTC is the core concept make it possible to transcribe unsegmented sequence data. RNNLIB implements it in a single layer called Transcription Layer. We go into this particular layer in this post, the main reference is the Graves’ original paper.
The key point for CTC is to use a simple map transforming the RNN output to unsegmented labelling, and construct a new objective function based on the map. This map do not need a precise alignment, thus greatly simplify the task and reduce human expert involvement.
“Connectionist” is the adjective form of “connectionism”, Connectionism is a terminology in cognitive science, which models mental or behavioural phenomena as the emergent processes of interconnected networks of simple units. The most common forms use neural network models.
In the traditional neural network recipe, we independently model the input sequence in each time-step or frame. This can be referred as framewise classification. Kadous extends the classification paradigm to multivariate time series, and names it as temporal classification. Mathematically, framewise classification models the distribution over output sequences of the same length as the input sequence, nevertheless, temporal classification models the distribution over output sequences of all lengths. With this, we do not have to label every time step in training data set.
Combining RNN and temporal classification, Graves proposes the connectionist temporal classification.
To distinguish from classification, RNNLIB implements the CTC as Transcription Layer, indicating that with CTC we can directly transcribe input sequence(e.g. acoustic signal) into output sequence(e.g. words).
Following the notations in the paper, we first list the symbols.
Symbol | Meaning |
---|---|
(finite) alphabet of labels | |
underlying distribution of data | |
set of training examples supposed to be drawn from |
|
( |
example in |
temporal classifier to be trained | |
RNN, with |
|
sequence of RNN output | |
the activation of output unit |
|
path, element of |
|
label sequence or labelling | |
map from path to labelling | |
sub-sequence of |
|
modified label sequence, with blanks added to the beginning and the end and inserted between every pair of labels in |
|
forward variable, the total probability of |
|
backward variable, the total probability of |
|
backward variable, the total probability of |
|
maximum likelihood objective function | |
Kronecker delta |
The goal is to use
To train the network with gradient descent,
we need to differentiate
Another thing we have to consider is how to map from network outputs to labellings.
Use
Then we can define the conditional probability of a labelling,
where,
To calculate
Note that the product of the forward and backward variables at a given
Rearranging and substituting in from
For any
On the other hand, combining
Thus to differentiate this with respect to
At this point, we can set
where,
Next, we can give the gradient for the unnormalised output
Then we get,
we write the last step by noting that
The last thing we have to do is calculating the forward and backward variables. We now show that by define a recursive from, these variables can be calculated efficiently.
Given a labelling
This gives us the following rules for initialisation
and recursion
Note that
Here we can get another method to calculate
Similarly, the backward variables can be initalisd as,
and recursion
Note that
Following figure illustrate the forward backward algorithm applied to the labelling ‘CAT’(from the paper).
The TranscriptionLayer
class inherits the SoftmaxLayer
class(see this post).
The feed_forward()
and feed_back()
methods are the general softmax function,
so only need to implement the calculate_errors()
method to calculate the
But backward variables are in another form, given in Graves’ Dissertation.
Consider backward variable started from time
Noting that,
Thus, we can get recursion formula for
Noting that, if
And the gradient for output
where,
Actually, the RNNLIB code computes
To wrap up, CTC using a forward-backward algorithm to efficiently compute the RNN output errors, corresponding to a new ML objective function. With these errors, we can use any traditional gradient methods to train the network.
Once the network is trained, we would use it to transcribe some unknown input sequence
There are two approximate algorithms.
This method assumes that the most probable path corresponding to the most probable labelling,
where
This is trivial to compute, simply by concatenating the most active outputs at every time step.
But it can lead to errors, because that the map
By modifying the forward variables, this method can efficiently calculate the probabilities of successive extensions of labelling prefixes.
Prefix search decoding is a best-first search through the tree of labellings, where the children of a given labelling are those that share it as a prefix. At each step the search extends the labelling whose children have the largest cumulative probability (see below figure ).
Each node either ends (
To extend the tree, we need to compute extended path probability, which can be computed in a recursive way.
Let
Then for a length
where
In fact, by definition, relation between
Using
And calculating the path probabilities,
The extension procedure start from
and iterate util
Given enough time, prefix search decoding always finds the most probable labelling. However, the maximum number of prefixes it must expand grows exponentially with the input sequence length. We need further heuristic.
Observing that the outputs of a trained CTC network tend to form a series of spikes separated by strongly predicted blanks, we can divide the output sequence into sections that are very likely to begin and end with a blank. We can do this by choosing boundary points where the probability of observing a blank label is above a certain threshold, then apply the above algorithm to each section individually and concatenate these to get the final transcription.
联系客服