One of the earliest approaches to isolated word speech recognition was to store a prototypical version of each word (called a template) in the vocabulary and compare incoming speech with each word, taking the closest match. This presents two problems: what form do the templates take and how are they compared to incoming signals.
The simplest form for a template is a sequence of feature vectors -- that is the same form as the incoming speech. We will assume this kind of template for the remainder of this discussion. The template is a single utterance of the word selected to be typical by some process; for example, by choosing the template which best matches a cohort of training utterances.
Comparing the template with incoming speech might be achieved via a pairwise comparison of the feature vectors in each. The total distance between the sequences would be the sum or the mean of the individual distances between feature vectors. The problem with this approach is that if a constant window spacing is used, the lengths of the input and stored sequences is unlikely to be the same. Moreover, within a word, there will be variation in the length of individual phonemes: Cassidy might be uttered with a long /A/ and short final /i/ or with a short /A/ and long /i/. The matching process needs to compensate for length differences and take account of the non-linear nature of the length differences within the words.
The Dynamic Time Warping algorithm achieves this goal; it finds an optimal match between two sequences of feature vectors which allows for streached and compressed sections of the sequence. The paper by Sakoe and Chiba (Dynamic Programming Algorithm Optimisation for Spoken Word Recognition) gives a detailed description of the algorithm; I will summarise it here.
We can arrange the two sequences of observations on the sides of a grid (Figure 11.1) with the unknown sequence on the bottom (six observations in the example) and the stored template up the left hand side (eight observations). Both sequences start on the bottom left of the grid. Inside each cell we can place a distance measure comparing the corresponding elements of the two sequences.
To find the best match between these two sequences we can find a path through the grid which minimises the total distance between them. The path shown in blue in Figure 11.1 gives an example. Here, the first and second elements of each sequence match together while the third element of the input also matches best against the second element of the stored pattern. This corresponds to a section of the stored pattern being stretched in the input. Similarly, the fourth element of the input matches both the second and third elements of the stored sequence: here a section of the stored sequence has been compressed in the input sequence. Once an overall best path has been found the total distance between the two sequences can be calculated for this stored template.
The procedure for computing this overall distance measure is to find all possible routes through the grid and for each one of these compute the overall distance. The overall distance is given in Sakoe and Chiba, Equation 5, as the minimum of the sum of the distances between individual elements on the path divided by the sum of the warping function (which will be discussed later). The division is to make paths of different lengths comparable.
It should be apparent that for any reasonably sized sequences, the number of possible paths through the grid will be very large. In addition, many of the distance measures could be avoided since the first element of the input is unlikely to match the last element of the template for example. The DTW algorithm is designed to exploit some observations about the likely solution to make the comparison between sequences more efficient.
The major optimisations to the DTW algorithm arise from observations on the nature of good paths through the grid. These are outlined in Sakoe and Chiba and can be summarised as:
Monotonic condition: the path will not turn back on itself, both the i and j indexes either stay the same or increase, they never decrease.
Continuity condition: The path advances one step at a time. Both i and j can only increase by 1 on each step along the path.
Boundary condition: the path starts at the bottom left and ends at the top right.
Adjustment window condition: a good path is unlikely to wander very far from the diagonal. The distance that the path is allowed to wander is the window length r.
Slope constraint condition: The path should not be too steep or too shallow. This prevents very short sequences matching very long ones. The condition is expressed as a ratio n/m where m is the number of steps in the x direction and m is the number in the y direction. After m steps in x you must make a step in y and vice versa.
By applying these observations we can restrict the moves that can be made from any point in the path and so restrict the number of paths that need to be considered. For example, with a slope constraint of P=1, if a path has already moved one square up it must next move either diagonally or to the right.
The power of the DTW algorithm goes beyond these observations though. Instead of finding all possible routes through the grid which satisfy these constraints, the DTW algorithm works by keeping track of the cost of the best path to each point in the grid. During the match process we have no idea which path is the lowest cost path; but this can be traced back when we reach the end point.
A path through the grid is written in the paper as F=c(1),c(2)...c(K), the generalised element of the path is c(k) and this consists of a pair of coordinates in the i (input) and j (stored) directions. The i coordinate of the kth path element is i(k).
The weighting function w(k) introduced into the overall distance measure is used to normalise for the path length. Two alternate weighting functions are presented: symmetric and asymmetric. Both functions are derived from the distance travelled (in grid units) in the last step of the path. The symmetric form combines the i and j directions while the asymmetric form uses just the i direction.
If the path has just made a diagonal step then i and j both increase by 1 and the symmetric w(k) = 1+1 = 2; the asymmetric w(k) = 1; The sum of this function over the length of the path gives a measure of how long the path is. This is used in normalising the overall distance measures.