RNNLIB: Connectionist Temporal Classification and Transcription Layer

The Name
The Theory
- List of Symbols
- Training Procedure
- The CTC Forward-Backward Algorithm
The Implementation
Decoding
- Best Path Decoding
- Prefix Search Decoding

CTC is the core concept make it possible to transcribe unsegmented sequence data. RNNLIB implements it in a single layer called Transcription Layer. We go into this particular layer in this post, the main reference is the Graves’ original paper.

The key point for CTC is to use a simple map transforming the RNN output to unsegmented labelling, and construct a new objective function based on the map. This map do not need a precise alignment, thus greatly simplify the task and reduce human expert involvement.

The Name

“Connectionist” is the adjective form of “connectionism”, Connectionism is a terminology in cognitive science, which models mental or behavioural phenomena as the emergent processes of interconnected networks of simple units. The most common forms use neural network models.

In the traditional neural network recipe, we independently model the input sequence in each time-step or frame. This can be referred as framewise classification. Kadous extends the classification paradigm to multivariate time series, and names it as temporal classification. Mathematically, framewise classification models the distribution over output sequences of the same length as the input sequence, nevertheless, temporal classification models the distribution over output sequences of all lengths. With this, we do not have to label every time step in training data set.

Combining RNN and temporal classification, Graves proposes the connectionist temporal classification.

To distinguish from classification, RNNLIB implements the CTC as Transcription Layer, indicating that with CTC we can directly transcribe input sequence(e.g. acoustic signal) into output sequence(e.g. words).

The Theory

List of Symbols

Following the notations in the paper, we first list the symbols.

Symbol	Meaning
L	(finite) alphabet of labels
L′	L∪blank
X	(Rm), m dimensional input space
Z	L, output space, set of all sequences over the L
DX×Z	underlying distribution of data
S	set of training examples supposed to be drawn from DX×Z
(x,z)	example in S, x=(x1,x2,…,xT), z=(z1,z2,…,zU) and U≤T
h:XZ	temporal classifier to be trained
Nw:(Rm)T(Rn)T	RNN, with m inputs, n outputs and weight vector w, as a continuous map
y=Nw	sequence of RNN output
ytk	the activation of output unit k at time t
π	path, element of L′T
l∈L≤T	label sequence or labelling
B:L′TL≤T	map from path to labelling
la:b	sub-sequence of l from ath to bth labels
l′	modified label sequence, with blanks added to the beginning and the end and inserted between every pair of labels in l
αt(s)	forward variable, the total probability of l1:s at time t
βt(s)	backward variable, the total probability of ls:\|l′\| at time t
β~t(s)	backward variable, the total probability of ls:\|l′\| start at time t+1
OML(S,Nw)	maximum likelihood objective function
δkk′	Kronecker delta

Training Procedure

The goal is to use S to train a temporal classifier h to classify previously unseen input sequences in a way that minimises the ML objective function:

OML(S,Nw)=∑(x,z)∈Sln(p(z|x))(1)

To train the network with gradient descent, we need to differentiate (1) with respect to the network outputs. Since the training examples are independent we can consider them separately:

OML({(x,z},Nw)ytk=ln(p(z|x))ytk=1p(z|x)p(z|x)ytk(2)

Another thing we have to consider is how to map from network outputs to labellings. Use B to denote such a map. Given a path, we simply removing all blanks and repeated labels and the remaining labels form a labelling(e.g. B(aab)=B(aa–abb)=aab).

Then we can define the conditional probability of a labelling,

p(l|x)=∑π∈B1(l)p(π|x)(3)

where, p(π|x) is the conditional probability of a path given x, and is defined as:

p(π|x)=∏t=1Tytπt,π∈L′T(4)

To calculate (2), we first define the forward and backward variable,

αt(s)=∑π∈LT:B(π1:t)=l1:s∏t′=1tyt′πt′(5)

βt(s)=∑π∈LT:B(πt:T)=ls:|l′|∏t′=tTyt′πt′(6)

Note that the product of the forward and backward variables at a given s and t is the probability of all the paths corresponding to l that go through the symbol s at time t, i.e.,

αt(s)βt(s)=∑π∈B1(l):πt=l′sytl′s∏t=1Tytπt(7)

Rearranging and substituting in from (4) gives,

αt(s)βt(s)ytl′s=∑π∈B1(l):πt=l′sp(π|x)(8)

For any t, we can therefore sum over all s to get p(l|x):

p(l|x)=∑s=1|l′|αt(s)βt(s)ytl′s(9)

On the other hand, combining (3) and (4),

p(l|x)=∑π∈B1(l)∏t=1Tytπt(10)

Thus to differentiate this with respect to ytk , we need only consider those paths going through label k at time t (derivatives of other paths is zero). Noting that the same label (or blank) may be repeated several times for a single labelling l, we define the set of positions where label k occurs as lab(l,k)={s:l′s=k}, which may be empty. We then differentiate (10) to get,

p(l|x)ytk=∑s∈lab(l,k)p(π|x)ytk=∑s∈lab(l,k)∏Tt′=1yt′πt′ytk=∑s∈lab(l,k)ytk∏t′≠tyt′πt′ytk=∑s∈lab(l,k)∏t′≠tyt′πt′=1ytk2∑s∈lab(l,k)αt(s)βt(s)(11)

At this point, we can set l=z and substituting into (2), then get the final gradient,

OML({(x,z},Nw)ytk=1p(z|x)1ytk2∑s∈lab(l,k)αt(s)βt(s)(12)

where, p(z|x) can be calculated from (9).

Next, we can give the gradient for the unnormalised output utk. Recall that derivative of softmax function is,

ytk′utk=ytk′δkk′ytk′ytk(13)

Then we get,

Outk=∑k′Oytk′ytk′utk=∑k′((ytk′δkk′ytk′ytk)(1p(z|x)1ytk′2∑s∈lab(l,k′)αt(s)βt(s)))=∑k′(1p(z|x)ytkytk′∑s∈lab(l,k′)αt(s)βt(s))1p(z|x)1ytk∑s∈lab(l,k)αt(s)βt(s)=ytk1p(z|x)1ytk∑s∈lab(l,k)αt(s)βt(s)(14)

we write the last step by noting that ∑k′∑s∈lab(l,k′)()≡∑|l′|s=1(), then, using (9), the p(z|x) is canceled out.

The CTC Forward-Backward Algorithm

The last thing we have to do is calculating the forward and backward variables. We now show that by define a recursive from, these variables can be calculated efficiently.

Given a labelling l, we first extend it to l′ with blanks added to the beginning and the end and inserted between every pair of labels. The length of l′ is therefore 2|l|+1. In calculating the probabilities of prefixes of l′ we allow all transitions between blank and non-blank labels, and also those between any pair of distinct non-blank labels(because of the map B, the repeated labels will be merged). We allow all prefixes to start with either a blank (b) or the first symbol in l (l1).

This gives us the following rules for initialisation

α1(1)α1(2)α1(s)=y1b=y1l1=0,s>2

and recursion

αt(s)={ytl′s(αt1(s)+αt1(s1))ytl′s(αt1(s)+αt1(s1)+αt1(s2))l′s=borl′s2=l′sotherwise(15)

Note that αt(s)=0,s<|l′|2(Tt)1, because these variables correspond to states for which there are not enough time-steps left to complete the sequence.

Here we can get another method to calculate p(l|x), by adding up all forward variables at time T, i.e.,

p(l|x)=αT(|l′|)+αT(|l′|1)(16)

Similarly, the backward variables can be initalisd as,

βT(|l′|)βT(|l′|1)βT(s)=yTb=yTl|l|=0,s<|l′|1

and recursion

βt(s)={ytl′s(βt+1(s)+βt+1(s+1))ytl′s(βt+1(s)+βt+1(s+1)+βt+1(s+2))l′s=borl′s+2=l′sotherwise(17)

Note that βt(s)=0,s>2t.

Following figure illustrate the forward backward algorithm applied to the labelling ‘CAT’(from the paper).

The Implementation

The TranscriptionLayer class inherits the SoftmaxLayer class(see this post). The feed_forward() and feed_back() methods are the general softmax function, so only need to implement the calculate_errors() method to calculate the Oytk. In order to use (12) to get output error, first need to calculate the αs and βs. Forward variables are got using (15).

But backward variables are in another form, given in Graves’ Dissertation. Consider backward variable started from time t+1,

β~t(s)=∑π∈LT:B(πt:T)=ls:|l′|∏t′=t+1Tyt′πt′(18)

Noting that, β and β~ has a simple relationship:

βt(s)=ytπtβ~t(s)(19)

Thus, we can get recursion formula for β~ by substituting (19) into (17),

β~T(|l′|)β~T(|l′|1)β~T(s)=1=1=0,s<|l′|1

β~t(s)=yt+1l′sβ~t+1(s)+yt+1l′s+1β~t+1(s+1)yt+1l′sβ~t+1(s)+yt+1l′s+1β~t+1(s+1)+yt+1l′s+2β~t+1(s+2)l′s=borl′s+2=l′sotherwise(20)

Noting that, if l′s≠blank, then l′s+1 must be blank.

And the gradient for output (12) becomes,

OML({(x,z},Nw)ytk=1p(z|x)1ytk∑s∈lab(l,k)αt(s)β~t(s)(21)

where,

p(z|x)=∑s=1|z′|αt(s)β~t(s)(22)

Actually, the RNNLIB code computes p(z|x) using (16).

To wrap up, CTC using a forward-backward algorithm to efficiently compute the RNN output errors, corresponding to a new ML objective function. With these errors, we can use any traditional gradient methods to train the network.

Decoding

Once the network is trained, we would use it to transcribe some unknown input sequence x. Decoding is referred to the task of finding the best labelling l,

l=argmaxlp(l|x)(23)

There are two approximate algorithms.

Best Path Decoding

This method assumes that the most probable path corresponding to the most probable labelling,

l≈B(π)(24)

where π=argmaxπp(π|x).

This is trivial to compute, simply by concatenating the most active outputs at every time step. But it can lead to errors, because that the map B is a many-to-one map.

Prefix Search Decoding

By modifying the forward variables, this method can efficiently calculate the probabilities of successive extensions of labelling prefixes.

Prefix search decoding is a best-first search through the tree of labellings, where the children of a given labelling are those that share it as a prefix. At each step the search extends the labelling whose children have the largest cumulative probability (see below figure ).

Each node either ends (e) or extends the prefix at its parent node. The number above an extending node is the total probability of all labellings beginning with that prefix. The number above an end node is the probability of the single labelling ending at its parent. At every iteration the extensions of the most probable remaining prefix are explored. Search ends when a single labelling (here XY) is more probable than any remaining prefix.

To extend the tree, we need to compute extended path probability, which can be computed in a recursive way. Let γt(pn) be the probability of the network outputting prefix p by time t such that a non-blank label is output at t. Similarly, let γt(pb) be the probability of the network outputting prefix p by time t such that the blank label is output at t. i.e.

γt(pn)=p(π1:t:B(π1:t)=p,πt=p|p|∣x)(25)

γt(pb)=p(π1:t:B(π1:t)=p,πt=blank∣x)(26)

Then for a length T input sequence x, p(p|x)=γT(pn)+γT(pb). Also let p(p…|x) be the cumulative probability of all labelling not equal to p of which p is a prefix

p(p…∣x)=∑l≠p(p+l∣x)(27)

where is the empty sequence. p(p…∣x) is the value for extending node in the prefix tree, and p(p∣x) is the value for end node.

In fact, by definition, relation between γ and α is,

γt(pn)=αt(2|p|)(28)

γt(pb)=αt(2|p|+1)(29)

Using (15), we get the recursion for γt(pn) given γt1(pn), extending p to p=p+k with label k∈L,

γ1(pn)γ1(pb)={y1k0p=otherwise=0

γt(pn)={ytk(γt1(pb)+γt1(pn))ytk(γt1(pb)+γt1(pn)+γt1(pn))pends inkotherwise(30)

γt(pb)=ytb(γt1(pb)+γt1(pn))(31)

And calculating the path probabilities,

p(p∣x)=γT(pb)+γT(pn)(32)

p(p…∣x)=γ1(pn)+∑t=2T(γt(pn)ytkγt1(pn))p(p∣x)(33)

The extension procedure start from p=, with initialisation,

1≤t≤Tp(∣x)p(…∣x){γt(n)γt(b)=0=∏tt′=1yt′b=γT(b)=1p(∣x)

and iterate util maxpp(p…∣x)<maxp′p(p′∣x).

Given enough time, prefix search decoding always finds the most probable labelling. However, the maximum number of prefixes it must expand grows exponentially with the input sequence length. We need further heuristic.

Observing that the outputs of a trained CTC network tend to form a series of spikes separated by strongly predicted blanks, we can divide the output sequence into sections that are very likely to begin and end with a blank. We can do this by choosing boundary points where the probability of observing a blank label is above a certain threshold, then apply the above algorithm to each section individually and concatenate these to get the final transcription.

Twitter Facebook Google+Email

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。