Department of Computing,
Macquarie University

Sydney

Australia

`<Steve.Cassidy@mq.edu.au>`

Copyright © 2002 Department of Computing

**Table of Contents**

You can get a single file version of this document here.

This unit is intended to introduce you to the area of speech recognition. The overall goal is that you understand enough about the underlying processes in a recogniser and about the structure of spoken language that you can make effective use of ASR systems and perhaps contribute to improve their inner workings.

These notes are intended to guide you through the unit and will be suplemented by copies of papers and book chapters as appropriate. You might notice that in places they seem cobbled together which is not surprising since they are (from SLP801 and SLP806 offered last year in the MSc Speech and Language Processing).

**Table of Contents**

The assesment for this unit will consist of a number of practical assignments and one written assignment.

*Data collection and Gaussian classification*(20%) You will collect some speech data (from yourself) and perform a small study on vowel classification using just one feature vector per vowel. You will compare some alternative feature types for this task.*Due 15 April 2002**Speech Recognition*(30%) you will work in teams of 2 to train a the Sphinx speech recognition system using Australian speech data. You will be assesed based on a report of the results of your work.*Due end of semester.**Speaker Recognition*(10%) You will develop a small speaker recogniser using data recorded from yourself and other students.*Due ??.**Written Assignment*(40%) You will write two reports on topics in speech recognition.*Due: Weeks 6 & 13.*

In order to avoid long delays processing marks, I will be unwilling to provide extensions except in genuine cases. If you wish to request an extension, please contact me before the deadline for an assignment. Penalties will be deducted for late submissions without good reason.

The intention of this assignment is to introduce you to speech data collection and to give you first hand experience with the annotation of speech data. You will record a small collection of isolated words and digit strings spoken by yourself or another volunteer -- these recordings will be made directly into your computer. Once you have this data you will view the acoustic signal and spectrogram and then annotate the data with the start and end times of each phoneme.

To make the recordings you will need a computer with a soundcard and a microphone. To produce good quality recordings you should use a good quality headset microphone and adjust the recording level so that the peak amplitude is close to the maximum level but does not exceed it -- you want to avoid `clipping' of the signal but get as much detail as possible in the recordings.

To complete this assignment you'll need the Emu speech database system which you can download from here for either Unix, Mac or Windows. You should also get a copy of the R statistical package and the associated Emu library as described on this page. I'll give you some pointers for using this software as the term progresses.

To record the data, use the Emu Capture tool which should be
installed by default on Windows and Mac and which is available from
me for Unix. This tool takes a list of words to be recorded (one
word per line) and a template skeleton and then prompts you for the
words and saves the recordings and a template file to disk in the
right place for Emu to find them. Once you've recorded your data
you should be able to double click the template file (on
Windows/Mac) to annotate the data with the Emu labeller. On Unix
type `emulabel digits` to accomplish the same
thing.

The set of words you record should include at least
*four repetitions* of all the monophthongal
vowels in your variety of English in an hVd context (eg. had, hard,
etc). The goal for these recordings is to be able to describe your
own vowel space and to compare this with those for standard accents
such as Australian English or Southern British English. As an
example, the following words can be used for Australian
English:

hod heard hid had hard hood heed whod head hudd

When making these recordings, take care to randomise the position of the words in the list for each repitition and as far as possible avoid `list intonation'. Note that repitition means a separately prompted recording, not just saying the word four times at the prompt.

You should also record *two repititions* of
the following digit strings:

9541 1249 3263 9874 6521 5169 6853 7268 5094 5320 4877 6687 5039 3300 2140 8790 1457 2121 3879 4806

Once you have recorded the data you need to construct an Emu database template for your data. This is a text file that describes the kinds of annotation you will make on the data, the location of the data and label files and various other configuration parameters. You will be labelling phoneme boundaries and grouping the phonemes into words. You will need the following two template files (one for isolated words and one for connected digits):

! template skeleton for COMP449 data collection -- isolated words ! save as words.tpx in your Emu installation directory level Speaker level Word Speaker level Phoneme Word label Speaker Sex label Speaker Accent label Speaker Set labfile Phoneme :extension phn :time-factor 1000 ! Here we define the legal label categories for various levels ! this gives us a menu in the labeller and we can use the ! category names in queries ! legal Phoneme vowel A E I O V U ai ei oi i@ u@ au @u @: @ a: e: i: o: u: legal Phoneme monophthong A E I O V U @: a: e: i: o: u: legal Phoneme shortvowel A E I O V U legal Phoneme longvowel @: a: e: i: o: u: legal Phoneme diphthong ai ei oi i@ u@ au @u legal Phoneme stop p tS dZ t k b d g legal Phoneme nasal m n N M legal Phoneme fricative f v s z S Z h D D- T legal Phoneme approximant w j l r legal Phoneme other H # legal Set - hVd digits legal Sex - M F ! possible accent labels, you could add more if you don't fit one of these legal Accent - Australian US Singapore Malaysia ! ************************************************************ ! Define the location of files, this should be set to the right ! location by the capture script. ! ************************************************************ path hlb,phn,trg %labels path wav %signals ! this tells Emu to look for sampled speech data in wav files track samples wav set PrimaryExtension wav

! template skeleton for COMP449 data collection -- digit strings ! save as digit.tpx in your Emu installation directory level Speaker level Word Speaker label Speaker Sex label Speaker Accent label Speaker Set labfile Word :extension wrd :time-factor 1000 legal Word - one two three four five six seven eight nine ten zero oh legal Set - hVd digits legal Sex - M F ! possible accent labels, you could add more if you don't fit one of these legal Accent - Australian US Singapore Malaysia ! ************************************************************ ! Define the location of files, this should be set to the right ! location by the capture script. ! ************************************************************ path hlb,wrd %labels path wav %signals ! this tells Emu to look for sampled speech data in wav files track samples wav set PrimaryExtension wav

When you have your database working you should annotate each recording. For the isolated hVd words you should mark in the boundaries of the phonemes in each word. For the connected digits you should mark in the start and end of the digit string and label it (at the Word level) with the numerical string being spoken. You will mark these on the levels shown in the signal view of the Emu labeller. For some notes on using the labeller for this task see Section 4.1 of the Interactive Worksheets on the CDROM with Harrington and Cassidy and the relevant sections of the Emu manual.

Our goal is to plot the positions of the first two formants for each vowel in your hVd database so that you can compare them with other data. Unfortunately, I can't make an automatic formant tracker available to you, we only have a commercial one which runs on Unix systems. The easiest solution is for you to take measurements of the formant frequencies by hand from the spectrograms displayed in Emulabel. I realise that this is a very backward methodology and makes little use of the Emu software that you have learned to use in SLP802 but it really is the only workable solution at the moment.

Measure F1 and F2 at the vowel target for each of the vowels in your isolated word set. This should give you two measurements per vowel (from the two repetitions).

Generate plots of your data on the F1/F2 plane.

Present your results as a web page, include links to download your speech data (pack it up as a zip file), label files and the template file used for Emu.

The aim of this exercise is to look at the use of distance measures and Gaussian models in classifying speech data according to various feature sets. We will study vowel data taken from the ANDOSL isolated word set that is included on the Harrrington and Cassidy CDROM. The goal of the study is to compare the utility of formants, bark scaled spectra and cepstra in classifying these vowels.

The result of your analysis will be in the form of a report on the experiment; it should include a method section, a description of the different experiments you carried out and a presentation and discussion of your results.

The data for analysis is in this zip file (235k), the file contains the following files:

`short.seg`- a segment list of 860 short vowels A E I O U V from all speakers in the isolated database.`short.info`- a list of the speaker identifiers and their sex for each vowel, speakers are`db, dw, jc, mb, nd`(male) and`dp, jn, nb, ru, vm`(female). Sex is coded as`m`or`f`.`short-fft.dat`- a 512 point fourier transform taken at the mid point of each vowel. The signal was pre-emphasised before the fft was carried out. This data is not intended for a direct classification experiment, instead it was used to generate the cepstral data and could be used to generate other parameters if you wish. This file is not included in the above zip file but is available seperately to download if you wish as short-fft.zip (1.2M).`short-fm.dat`- the first four formants, automatically tracked using the ESPS formant tracker.`short-bark.dat`- 22 bark scaled spectral bands averaged over the duration of each vowel. The signal was pre-emphasised before the analysis was carried out.`short-cep.dat`- 40 cepstral coefficients generated from the fft data in`short-fft.dat`.

You will use R and Emu to analyse this data, the instructions below give you more or less everything you'll need to carry out the experiments. See me for help getting everything installed etc. Once you have R running you need to load the Emu library with:

library(emu)

After which you should see the Emu version number etc to verify that it has loaded ok.

Each of the data files is a text file with one row per vowel
and different numbers of columns depending on the size of the
feature vector. To read them into R you can use the
`matscan` function as follows:

short.fm <- matscan( "short-fm.dat", 4 ) short.fft <- matscan( "short-fft.dat", 512 )

Where the second argument to `matscan` is the
width of the data. `matscan` is part of the Emu
library so you will need to load that first.

You can read the segment list into R using the command:

short.seg <- read.segs( "short.seg" )

This is just the result of a query to the database and has the same
form as the result of `emu.query`.

You can read the speaker information using the following command:

short.info <- read.table("short.info", col.names=c("speaker", "sex"))

Cepstral data can be generated from the fft data by first defining a function to derive the cepstrum:

## define a function to make a cepstrum from an fft spectrum fft2cep <- function(x,n) { as.vector(Mod(fft(log10(x), T ))/length(x))[1:n] }

and then applying that function to each row of the fft data matrix. Note that the number 30 here refers to the number of cepstral coefficients (the argument 'n' to fft2cep):

## apply the fft2cep function to each row of the fft data ## we need to transpose t() the result because of the way that apply() ## works short.cep <- t(apply( short.fft, 1, fft2cep, 30))

As an aside, note that if you want to calculate a cepstrally smoothed spectrum you need to keep the cepstrum in the complex domain (the result of an fft is always a complex vector representing amplitude and phase). The following function will derive a cepstrally smooted spectrum from a raw fft:

fft2cspectrum <- function(x,n) { ## calculate the cepstrum y <- fft(log(x))/length(x) ## blank out the middle section y[(n+1):(length(x)-n-1)] <- 0 ## return the real fft return(Mod(fft(y))) }

You can then plot an FFT and smoothed spectrum of the first row of short.fft as follows:

par(mfrow=c(2,1)) x <- short.fft[1,] plot(log(x), type="l") plot(fft2cspectrum(x,50), type="l")

You will carry out a set of classification experiments as outlined in section 9.4 of Harrington and Cassidy. The steps are as follows:

Partition the data into training and testing sets.

Train a Gaussian model on the training data.

Classify the test data using the model.

Compare the classifications with the true labels.

## we will take the first 600 tokens as training data, the rest to test select <- 1:600 traindata <- short.fm[select,] # note the comma means keep all columns. trainlabs <- label(short.seg)[select] # but not here (it's a vector) select <- 601:860 testdata <- short.fm[select,] testlabs <- label(short.seg)[select] ## now we'll train a model model <- train( traindata, trainlabs ) ## and test it autolabs <- classify( testdata, model ) ## and report on the results confusion( testlabs, autolabs ) perform(confusion( testlabs, autolabs ))

The report on the experiment above is a confusion matrix as
discussed in Harrington and Cassidy. The `perform`
function sums the diagonal elements of the matrix to give an
overall performance score.

For the cepstral data, remember that the lower numbered cepstral coefficients encode the vocal tract shape while higher order ones encode source information. Speech recognition systems often use 10-12 cepstral coefficients. You should experiment with different numbers of coefficients to find an optimal number for this experiment. Look at the results for a large number of coefficients (40), comment the results of these experiments.

When you are experimenting with different numbers of coefficients, the proper procedure is to use a different set of testing data to your final test set. In this way you ensure that you are not optimising the feature set for this particular set of data and are in fact getting a realistic evaluation in your final test. One option is to perform closed tests while optimising the feature set -- test on the same data that you train on. A better way is to partition the data into three sets, one for training, one for optimisation testing and one for the final evaluation. Each partition should be representitive of the data as a whole as far as possible, one valid approach if you are interested in cross-speaker performance is to train and test on different speakers, eg. train on five speakers, optimise on two speakers and test on three speakers.

To select according to speaker use the following procedure:

select <- muclass( short.info$speaker, c("db", "jc") ) traindat <- short.fm[select,] trainlab <- label(short.seg)[select,]

The function `muclass` returns a vector
which is true for labels (first argument) which are in the supplied
label vector (second argument). This vector can then be used to
subset the data.

*Hint:* you might like to encode the
experimental procedure as an R function so that you don't have to
repeat the same command sequence over and over. This is quite easy
to do, for example the above commands could be made into a function
with the lines:

experiment <- function( data, labels ) { ## we will take the first 600 tokens as training data, ## the rest to test select <- 1:600 traindata <- data[select,] trainlabs <- labels[select] ... confusion( testlabs, autolabs ) } # this can be called with experiment( short.fm, label(short.seg) )

If you wanted to be more general you could calculate the train/test partition boundary based on the number of tokens in the data set.

As an *optional* additional experiment,
you can use principal components analysis to reduce the
dimensionality of the Bark scaled spectral data. The
`princomp` function performs a PCA on a data
matrix, the procedure would be as follows:

## load the mva library to get the princomp function library(mva) short.bark.pca <- princomp(short.bark) short.bark.transformed <- short.bark %*% loadings(short.bark.pca)

You can now perform a classification experiment on the
transformed data set, using either the whole data or a subset of
the columns; for example
`short.bark.transformed[1:10,]` gives you the
first ten principal components.

You can generate a scree plot (see H&C p. 269) by just
plotting the result of the call to `princomp`, see
the `princomp` help page for an example.

The result of your analysis will be in the form of a report on your experiments; it should include a method section, a description of the different experiments you carried out and a presentation and discussion of your results. While I don't want to see all of the R commands you used to generate your results, you should explain the major steps. You may want to include plots of vowel data in the formant plane to illustrate the distribution of the data.

The aim of this exercise is to implement a basic speaker recognition system using VQ distortion measures.

As a second stage of this assignment you will implement a basic speaker identification system using the same basic components. The major difference is that the stored models correspond to a single speaker rather than a word. You will implement a VQ distortion based speaker identification system.

To implement a speaker recognition system you need to be able to build a per-speaker VQ codebook and measure the VQ distortion for an input speech signal given a codebook. Here is an outline of the components you will need to write:

*Data Input.*You will use Emu to retrieve digit data from your own labelled recordings. Emu can give you the data corresponding to each digit along with it's label.*Feature Extraction.*There are a number of options for feature extraction. To simplify the process we can pre-calculate feature vectors using tools outside of the R environment and read them in using Emu.*Vector Quantisation.*You will implement a basic vector quantiser using the inbuilt k-means clustering functions. VQ allows you to pre-compute the distances between codebook entries.Write a function to generate a VQ codebook model for each speaker in the input data.

Write a function to calculate the VQ distortion for some input speach and a VQ codebook. VQ distortion is defined as the summed distance between each input feature vector and the closest VQ codebook entry. You should write one function to calculate the distortion for one feature vector and another for applying this to the whole input vector sequence. Note that these functions will be very similar to those you use to VQ encode an input sequence. The summed distortion should be normalised by dividing by the length of the input sequence.

Write a function to compare the VQ distortion on a number of speaker codebooks for an input feature vector sequence (using the above functions) and output a speaker decision.

The first stage in constructing your recogniser is to build a vector quantisation routine. This is aided significantly by the provision of a k-means clustering routine in R. You can generate a set of cluster centers using the single line:

library(mva) # to load in the kmeans function clusters <- kmeans( digits.a.melcep$data, 64 )which will find 64 clusters in the data. The cluster centers are now stored in

You need to write a function which takes a set of data and returns the result of the kmeans function; this will act as your VQ model which you will then use in another function to assign data to one of the clusters. This second function needs to take a data matrix and the cluster object as an argument and return a vector of cluster numbers. For example:

clusters <- generate.vq( digits.a.melcep ) data.vq <- vq( digits.a.melcep[1]$data, clusters )The second function is operating on the data for the first token in the melcep trackdata object (it is possible to write a function which operates on a trackdata object, returning a new trackdata object. I can explain how if you want to go that way.)

As a final step you can pre-calculate the distances between
each vq cluster centre and store the result in a matrix. There is a
convenient function `dist` which does exactly this
but you should ensure that the distance metric that it is using is
the same that you have used in your vq function. Once you have a
distance matrix you can visualise the result using the
`image` function.

In order to be able to test your functions it will be useful
to work with data that can be visualised. I suggest you use some of
the formant data we worked with earlier and generate a vq codebook
for the first two formants. Then you can plot the formant data with
their cluster labels (using `eplot`) to verify
that you are indeed doing sensible clustering with your vq
routine. I tested my routines on the first two columns of the mel
cepstral data, here is a picture of the resulting clustering and
the commands used to generate it.

model <- generate.vq( digits.a.melcep$data[,1:2], 8 ) foo <- vq( digits.a.melcep$data[,1:2], model ) eplot( digits.a.melcep$data[,1:2], foo, dopoints=T, doellipse=F )

**Figure 2.1. An example clustering of the first two columns of the
melcep data for speaker `a'. Note that the data for vq label `8'
has been plotted in white, it fits in between the green '3'
cluster and the cyan '5' cluster.**

The second part of your assignment is to implement the Dynamic Programming template match discussed in Chapter 11. The core of the template matcher is a function which compares two input sequences, in our case these will be sequences of codebook entries. The comparison is made using the time warping algorithm we looked at, you can take the simplest form as your starting point making it more complex if you have time and enthusiasm. The function will need to build a matrix of distance values as it compares the two sequences. The value of each cell is calculated from those of the neighbouring cells. The resulting distance is the value of the final cell in the top right of the matrix. At this point we aren't interested in the path taken through the matrix but you might want to add that functionality if you are interested in seeing it work.

To write the dtw function, first allocate a matrix of
distance values of the appropriate size (`dd <-
matrix(nrow=length(template), ncol=length(input))`) and
then step through the template and input (with two nested
`for` loops) calculating the distance for each
cell. The result of the function is the distance value in the final
cell of the matrix. This basic function can then be optimised by
adding in the adjustment window condition so that only cell
contents within some distance of the line i=j need to be
calculated.

It is difficult to know whether you've got the right answer once you've written the code. To verify your procedure I'd suggest constructing a simple case to which you can work out the answer by hand. Let's say you have a codebook with centers:

1: 1 2 2: 3 4 3: 5 6 4: 7 8then you can work out the between code distance matrix easily and then calculate the distance between the input vectors:

[1,2,3,4,3,2,1] and [1,3,4,3,3,1]for example. You should be able to do this by hand and then verify the result using your function (including looking at the matrix values which would normally be thrown away).

As an extension to the simple dtw function, you could calculate best match path between the two vectors. To do this you need a second array which records which of the three possible paths into a cell were taken. At the end of the procedure you can trace back the best path in this array.

Once you have the core `dtw` function you
need to write code to match an input vector against a set of stored
templates and output the label of the closest template. You can
store the set of templates as a list in R (you can't store them in
an array since they aren't of the same length). I suggest you
choose a random example of each digit as your template in the first
instance, the alternative is to implement some kind of template
selection procedure which is clearly more complicated. For example,
the first element in the `digits.a` data is a
nine, and the second is a five. Here is an example function which
selects the first example of each label as the template and returns
a list of the templates and their labels. It assumes there is a
function `vq` which can vector quantise a data
sequence.

select.templates <- function(segments, data, codebook) { ## select a set of vq templates given an input data set, ## we want to choose one example of each label and store ## the vq version of it as our template. ## The result is an object with components $templates, which ## is a list of the vq templates, and $labels which is a vector ## of the corresponding labels. ## eg. ## codebook <- generate.vq( mydata, 32 ) ## tpl <- select.templates( mysegs, mydata, codebook ) ## tpl$templates[[1]] -- is the first template ## tpl$labels[1] -- is the corresponding label ## ind is a vector of index numbers, one per segment in the data ind <- 1:nrow(segments) ## set up an empty result templates <- NULL templates$templates <- NULL templates$labels <- NULL for( lab in unique( label(segments) ) ) { ## select an example of this label, take the first one n <- ind[label(segments)==lab][1] ## calculate the vq version of the nth data element nvq <- vq( data[n]$data, codebook ) ## and add it to the list templates$templates <- c(templates$templates, list( nvq )) ## and add the label to the labels list templates$labels <- c(templates$labels, lab) } return(templates) }

Your recognition function will take the result of
`select.templates` (or similar) and an input
sequence. It will vector quantise the input and match it against
every template, selecting the best match and returning the
corresponding label. You can then write another function to call
this function with each element of the data for testing purposes.
You should include some test results in your final submission: how
good is your recogniser on a same-speaker evaluation or on
cross-speaker evaluation? How does the codebook size affect
recognition performance? You should also look inside your decision
algorithms and see how close the correct answer was in the cases
where you got it wrong.

One peculiarity of R is that `for` loops are
not the most efficient way to process vectors or arrays of data. R
is able to do vector arithmetic as you know, and this is much more
efficient than doing the same calculation in a loop. Similarly, if
you are doing some operation to every row of a matrix to generate a
new matrix, it is faster to use the `apply`
function than to use a `for` loop. For example, to
sum the rows of a matrix foo:

foosum <- apply( foo, 1, sum )is much better than:

foosum <- 0 for( i in 1:length(foo) ) { foosum <- foosum + foo[i] }look at

This might help you make your program faster, although you
can still write a perfectly valid (and often more readable)
function with for loops, you won't be penalised for not using
`apply`.

You will hand in your completed code for the speaker recognition problems plus a short report on how the problem was solved and how the programs perform on the data provided. The evaluation part is important since I want to see that you've run the code that you've written and if appropriate, modified the programs in the light of what you observe (eg. changing thresholds etc.).

The goals of this assignment are to gain first hand experience in training an HMM based speech recogniser. We will use the freely available Sphinx recogniser and train it on some Australian speech to get a localised recognition engine. This project is sufficiently involved to have you work in pairs to complete it. I will give each pair a different set of training data so that we can compare results at the end. Tasks that will need doing are:

Pre-process the speech data to fit into the formats required by the training tools.

Develop a pronunciation dictionary for Aus. English in the appropriate format for Sphinx. There are a few sources of pronunciations that we might use here.

Establish an HMM architecture, how many states per model, generate initial models etc.

Train the HMM

Develop a language model for a chosen test application (eg. route planning!)

Deploy and evaluate the recogniser.

Many of the details will be worked out as we go but we are helped by a good set of tools and documentation from CMU at the Sphinx website.

**Table of Contents**

This section is included here as a reference only. The material here was covered briefly in the first two weeks of class.

**Table of Contents**

- 3.1. Speech signals
- 3.2. Properties of Sinusoids
- 3.3. Adding Sinusoids
- 3.4. Fourier theory
- 3.5. Spectrograms
- 3.6. Exercises
- 3.7. Additional Reading

Speech propagates as a longitudinal wave in a medium, such as
air or water. The speed of propagation depends on the density of the
medium. It is common to plot the *amplitude* of
air pressure variation corresponding to a speech signal as a
function of time; this kind of plot is known as a speech pressure
waveform or just a speech waveform. Here's an example taken from
part of a vowel:

The plot shows the change in air pressure with time and we can
see that in this example the pattern of changes is regular in that
each up-down cycle has roughly the same shape. This corresponds to
a *periodic* signal and these signals make up the
voiced speech sounds such as vowels, and voiced consonants such as
`b', `w' and `z'. Another kind of sound is irregular or
*aperiodic* and so appears to be just random
variations in air pressure. Examples are the hissing sound in `s'
and `sh'.

The simplest form of sound that we can describe is called a sinusoid. This waveform corresponds to a pure tone and as we will see later forms the basis for all more complex sounds. Since a sinusoid has such a pure shape it is a useful starting point when describing the properties of sound waves.

A sinusoid corresponds to a simple smooth up and down movement when drawn as a time waveform. Three kinds of measurement can be taken from a sinusoid and together they completely define the shape of waveform. These measurements are:

*Amplitude:*the size of the displacement of the sinusoid above and below the mid line. This corresponds to the energy in the sound wave and hence how loud it appears to be. There are many ways of measuring amplitude; since it relates to the size of the pressure variations in the air it can be measured in units of pressure. More often we talk about deciBels (dB) which measure amplitude on a logarithmic scale relative to a standard sound. The dB scale is useful since it maps directly to the way that humans percieve loudness.*Frequency:*the number of cycles made by the sinusoid each second. A cycle consists of an oscilation from the mid line to the maximum, down to the minimum and back to the midline. Frequency is measured in*cycles per second*or*Hertz (Hz)*. The*period*is the inverse of this -- the time taken for one cycle. Changes in frequency are percieved as changes in the pitch of a sound (although pitch has a more complicated perceptual definition so the two quanitities are not the same).*Phase:*measures the position of the starting point of the sinusoid. Those starting at the maximum have a phase of zero while those starting at the minimum have a phase of π radians. Phase is best explained with reference to a rotating wheel, see Harrington and Cassidy Chapter 2, Figure 2.7. for more details. The phase of a sinusoid cannot be percieved but we can detect relative changes in phase between two signals, in fact this forms the basis of binaural hearing as the brain works out the location of a sound based on the different phases heard at the two ears.

When two sinusoids are added together the result depends upon their amplitude, frequency and phase. The effects are easiest to observe when only one of these is varied between the two sinusoids being added.

In the simplest case, when two sinusoids with the same frequency and phase but with different amplitudes are added together the result is a sinusoid who's amplitude is the sum of the originals and who's frequency and phase remain unchanged.

When the two sinusoids have different frequencies the result is more complicated. The new signal is no longer a sinusoid since it doesn't follow the simple up and down pattern. Instead we see the higher frequency sinusoid as a `ripple' superimposed on the lower frequency sinusoid (see figure). The frequency of the resulting signal will be the lower of the two original frequencies. In the figure we can see that the resulting signal goes through two and a half cycles (that is it repeats itself this many times) just like the second sinusoid. The amplitude of the resulting signal is the sum of the originals and the phase is unchanged.

**Figure 3.1. Addition of sinusoids with different frequencies.**

Adding together sinusoids with different phases can have surprising results. If two sinusoids are `in phase' then their peaks and troughs coincide and the result is as observed earlier. If the sinusoids are `out of phase' or in other words if they differ in phase by π radians then their peaks and troughs oppose each other and they will cancel each other out. The phase of the resulting sinusoid is the sum of the phases of the constituents.

Adding up more than two sinusoids can produce complex looking
waveforms. When a number of different frequencies are combined the
frequency of the result will in general be that of the lowest
frequency component. This is often called the *fundemental
frequency* of the signal.

Fourier theory says that any complex periodic waveform can be
decomposed into a set of sinusoids with different amplitudes,
frequencies and phases. The process of doing this is called
*Fourier Analysis* and the result is a set of
*amplitudes*, *phases*and
*frequencies* for each of the sinusoids that
makes up the complex waveform. Adding these sinusoids together again
will reproduce *exactly* the original
waveform.

A plot of the frequency or phase of a sinusoid against
amplitude is called a *spectrum*. For a pure
sinusoidal tone this is just a single line as shown in Figure
2.9. of Harrington and Cassidy. An amplitude spectrum of a complex
waveform shows the amplitude of each frequency component or
sinusoid. The phase spectrum shows the phases of these components
but is very seldom used in speech analysis. Spectra are used
extensively in speech analysis to discern the characteristic shapes
associated with different classes of speech sound.

Any periodic signal shows a pattern of repitition in the time
waveform which corresponds to the primary rate of vibration of the
signal, known as the *fundamental frequency*.
This corresponds to the lowest major frequency component in the
signal and, in the case of voiced speech, equates to the frequency
of vibration of the vocal folds. A vibration source like the vocal
folds also produces what are called *harmonics*
which are oscillations at multiples of the fundamental
frequency. For example, a 100Hz source vibration will give rise to
harmonics at 200Hz, 300Hz, 400Hz etc. In the spectrum of such a
signal, the harmonics show up as spikes at these frequencies.

The fundamental frequency can be measured from a speech waveform by looking for the period of oscillation of the signal around the zero axis. Estimates can also be made from the spectrum since it shows a large peak at this frequency and at each multiple due to the harmonics. Measurements can be made of the frequency of the major peak or of the distance between harmonic peaks.

A spectrum shows the freqency content of a short section of a waveform but it we want to see how the spectrum changes with time through a section of speech we need a different kind of display. A standard way of displaying this information is a spectrogram; this is a two dimensional plot of frequency against time where the amplitude at each frequency is represented by the darkness of the corresponding point in the display. The figure shows an example spectrogram aligned with the speech waveform for the word `stamp'. Experienced phoneticians can glean a lot of information from these kinds of displays and you will look at lots of them during your later practical work. For now we can observe a few prominent features: the dark are corresponding to the high frequency energy in the /s/ phoneme, the vertical striations in the /A/ phoneme corresponding to the fact that this vowel is a periodic signal (voiced) and the sudden onset of the relatively light area corresponding to the release of the /p/ at the end of the word.

**Figure 3.2. An example wideband spectrogram of the word `stamp'.**

The vertical striations in the vowel of `stamp' give another way of measuring the fundamental frequency of the signal at this point. In this kind of spectrogram (called a wideband spectrogram) each vertical striation corresponds to one open-close cycle of the glottis; hence measuring the time between striations can tell us the period of oscillation and hence the fundemental frequency of the speech signal at this time.

The worksheets included with Harrington and Cassidy contain a number of exercises which are useful in illustrating the properties of sinusoids and spectra. Section 2 of the 'Interactive Workseets' illustrates the shape of sinusoids and the result of adding them together, measurements of frequency from complex waveforms and shows some examples of spectrograms.

Basic acoustics is covered in both Ladafoged (Chapter 8) and
Harrington and Cassidy (Chapter 2). For a much more detailed and
slower paced discussion of sound propagation and the properties of
waveforms see *Introduction to Sound* by Charles
E. Speaks published by Singular Publishing (ISBN:
1879105985).

**Table of Contents**

Much of this unit will be concerned with the manipulation of speech signals by digital computers. The power of modern desktop computers means that advanced signal processing and speech analysis software can run on the average computer. In order to take advantage of this, and in order to understand the properties and limitations of digital signals, we need to look at how a speech siganl is converted into digital form.

An acoustic signal is transmitted via the motion of the
molecules of the air or other medium through which it passes. As we
saw earlier the signal takes the form of changes in pressure over
time and we characterise the signal by measuring this pressure as a
function of time. Pressure is an *analog*
quantity in that it varies continuously and can take on an infinite
number of different values.

Although sound starts it's life as changes in air pressure, it
can also be transmitted in different ways, for example via a
microphone and telephone cable to a remote listener. The process by
which sound changes from being a pattern of vibration in the air to
a pattern of electrical energy in the telephone cable is known as
*transduction*. Devices which change the form of
a sound signal, in this case the microphone and loudspeaker, is
called a *transducer*. The most interesting class
of transducers for us are those that turn sound pressure waves into
patterns of electrical energy. We can measure the electrical energy
in the same way that we measure pressure to get a picture of the
pattern of variation within the acoustic signal. Electrical energy
is again an analog quantity and so we now need some way of changing
this into the digital domain of modern computers.

Digital computers work at their lowest level with binary or base 2 numbers. In order to analyse and manipulate speech signals in the computer we need to convert the continuously varying signal into a series of discrete numbers; this process is known as digitisation. The basic digitisation hardware is known as and analog-to-digital or A/D converter and it takes snapshots of an input analog signal at regular intervals outputting a number which is closest to the magnitude of the snapshot measurement. Hence if the converter sees 1.376 volts it might output a value of 1.4 volts -- the closest `round number' to the input measurement. The questions of importance when talking about digitisation are the rate at which snapshots are taken and the size of the smallest change that can be detected in the input signal.

It should be clear that taking a series of snapshots of a signal can only capture an approximation of the original. When a movie or TV camera captures thirty or so frames per second we rely on persistence of vision to transform the result into a continuous moving image when played pack. Even so, any event that happens between snapshots or any changes that happen too quickly will not be reproduced properly: the classic example is the wheels on covered wagons in John Wayne movies which appear to be going backwards. The same problem arises in digitising audio signals and so we must be careful to capture at least the important information in the signal; to do so we need to understand what the limitations are.

The *sampling frequency* or sampling
rate is a measure of the number of snapshots taken from the signal
each second. Sampling frequency is measured in Hertz, just like
any other frequency measure.

If we consider sampling a sinusoid signal we can see what
the limits of the process are. Figure 5.2 from Harrington and
Cassidy (reproduced here)
shows two sinusoids with different frequencies sampled at the same
sampling frequency of 10Hz. In the bottom pane, a signal with a
frequency of 5Hz is sampled at at each peak and trough and once
close to the midpoint of each cycle: four samples for each cycle
in total. The result is a version of the original which retains at
least its frequency and amplitude if not the exact shape. The
upper pane shows a 15Hz signal and we can see that because the
sampling frequency is too low, most of the peaks and troughs are
missed; as it happens the sampled signal matches exactly that from
the 5Hz example, so the original 15Hz signal would be reproduced
as a 5Hz signal if a sampling rate of 10Hz was used. This
phenomenon is known as *aliasing*; the higher
frequency signal is said to be aliased onto the lower frequency
signal.

The absolute minimum number of samples per cycle needed to
properly reproduce a sinusoid is two -- one at the peak, one at
the trough. This will give a crude approximation to the original
signal but will be able to capture the frequency and amplitude.
This means that the sampling frequency should be at least twice
the frequency of the sinusoid being digitised; this is know as the
*Nyquist Frequency*.

Since we know that any complex signal can be thought of as the summation of a number of sinusoids, the above result can be stated more generally. The sampling frequency should be at least twice the frequency of the highest frequency of interest in the input signal. The telephone system uses a sampling frequency of 8000Hz and so can capture only information up to 4000Hz. In studying speech recorded in quiet conditions we often use a sampling frequency of 20000Hz which gives information up to 10000Hz.

From the earlier example we saw that signals above the
Nyquist frequency will be aliased so that they look like signals
below it. In fact the Nyquist frequency acts like a mirror such
that any sinusoid with a frequency of F_{N} +
x will appear as a sinusoid of frequency F_{N}
- x. For complex signals, this has the consequence that any
information above the Nyquist frequency will reflect back onto the
lower frequencies and contaminate the result. In order to
reproduce the information below the Nyquist frequency accurately
the higher frequency information must be removed
*before* the signal is digitised. The device
which does this is a *low pass filter* often
called an *anti-aliasing filter*.

We are used to using a base ten number scheme but modern
computers use *binary numbers* for various
reasons. Binary numbers consist only of ones and zeros which is
useful as these can be stored and transmitted as the presence or
absence of a physical property, like electrical charge or a high
frequency tone. Each *bin*ary
dig*it* is called a *bit*
and all information inside a computer is stored as strings of
bits.

A binary number can be decoded by adding up powers of two;
each place in the digit corresponds to a higher power of two,
starting from the right most position which is
2^{0} = 1 followed by
2^{1} = 2 then 4, 8, 16 etc. So the
binary number `1011' is 1*8 + 0*4 + 1*2 + 1*1 = 3. Decoding these
numbers by hand can be longwinded but the process is easy to
understand. Any decimal integer can be rewritten as a series of
binary digits.

When we digitise an acoustic signal it is turned into a sequence of binary numbers by the analog-to-digital hardware. There is an important consequence of this process that we need to understand: that these devices use a fixed number of binary digits to represent each sample and hence that the size of the smallest change that can be detected in the input is related to the number of bits used.

Analog to digital hardware uses a fixed sample size to
represent the sampled acoustic signal; typically 12 or 16 bits are
used per sample. A little arithmatic will tell us that 12 bits
will give us a maximum of 2^{12} = 4096
different numbers while 16 bits gives
2^{16} = 65536 values. These numbers will
be used to represent the different input voltages taken from the
microphone. When the hardware measures the size of the input
voltage from the microphone, instead of calculating a voltage
value it merely assigns it a number on a scale of 0 to 4096 (for a
12 bit digitiser). Since our input signal consists of both
positive and negative peaks (oscillations), the 0 point should
correspond to the largest negative peak and the 4096 point to the
largest positive peak. The output of the digitiser when the input
is zero should be around 2048.

The outcome of this is that the continuously variable input
signal is *quantised* into one out of 4096
values. If, for example, the range of the input were +3v to -3v
then each binary digit (bit) would correspond to a change of
6/4096=0.0015v in the input. Any change in the input smaller than
this will not be captured. This is a similar consequence to that
in the previous section where events which happen between samples
cannot be resolved. The result of quantisation can be seen in
Figure 5.4 on page 135 of Harrington and Cassidy (reproduced
below). The sampled signal (dotted line) is an approximation of
the original in both time and amplitude.

A digitiser that uses 16 bits will be correspondingly more accurate than one which uses 12 bits. The compact disc standard uses 16 bit samples taken at a 44kHz sample rate. For speech analysis it is common to use 12 bit samples at around 20kHz for studio quality recordings or 8-10kHz for telephone or office environment recordings.

Once a digital signal has been captured it can be stored
inside the computer as a sequence of numbers. Our aim in doing this
is to be able to manipulate and analyse this digital form to extract
information from it and convert it to different forms. To this end
we need to look at the concept of *vectors* and
the operations of *vector arithmetic*.

A vector is the mathematical name for a sequence of numbers of arbitrary length. In Harrington and Cassidy and in these notes we write a vector as a list within square brackets:

This vector consists of six elements. We often denote a vector algebraically by a bold letter, for example[1, 3, 19, 299, 34, 12]

x= [1, 3, 19, 299, 34, 12]

A number of simple operations are possible on such vectors and these are described in Harrington and Cassidy Section 5.3, page 138. In summary we can add a scalar (a scalar is a single number, as opposed to a vector which is a sequence of numbers) to a vector, multiply by a scalar or add/multiply two vectors. When combining vectors in this way both must have the same length since the elements are combined in a pairwise manner. See the text for examples and notation.

When a vector is used to denote an acoustic signal some more assumptions are made. Acoustic signals are notionally infinite (they have no start or end) so for any finite vector we assume that the numbers before and after the vector exist and are zero. This becomes important when we shift vectors to the left or right (a useful operation in some common signal processing operations) -- in this case a zero is shifted in the the first (or last) position of the vector.

The exercises for this section come from the "Practical Exercises with XlispStat" -- the second set of worksheets on the CDROM accompanying Harrington and Cassidy. These worksheets require that you learn a small amount of the Lisp language although most of the exercises can be completed by copying from the examples in the worksheets. You should begin with the introduction in section 2 which covers the basics of the language and environment. The exercises for later chapters will build on the basic commands that you learn here.

For this chapter you should work through the worksheets in section 3 (Time Domain Processing) up to section 3.4. Make sure that you are comfortable with the notion of storing sequences of numbers as vectors and the operations that you can perform on them. You will use these operations later to perform some simple digital signal processing on speech signals.

**Table of Contents**

- 5.1. Windowing Signals
- 5.2. Time Domain Parameters
- 5.3. Exercises
- 5.4. Additional Reading

Having digitised a speech signal we now have a vector of samples which can be manipulated in various ways. Our aim in this course is to be able to extract characteristic parameters from a signal, which can be used to identify different kinds of speech sounds. This session will discuss some parameters which can be extracted directly from the time domain signal.

When we analyse signals we tend to do so on only small
portions at a time. The reason for this is that we typically assume
that the signal is constant (eg. has a constant frequency) over the
time-span of our analysis. Since speech is actually changing
rapidly, we have to cut it into small parts for this assumption to
hold. Hence we typically *window* a speech
signal before analysis, which just means that we cut out a small
section.

Windowing can be seen as multiplying a signal by a window which is zero everywhere except for the region of interest, where it is one. Since we pretend that our signals are infinite, we can discard all of the resulting zeros and concentrate on just the windowed portion of the signal.

The above window (zeros and ones) is known as a
*rectangular window* because of its shape. One
problem with this kind of window is the abrupt change at the edge,
which can cause distortion in the signal being analysed (in fact
*any* windowing operation causes distortion,
since the signal is being modified by the window). To reduce this
distortion we often use a smoother window shape called a
*Hamming* window. This window is zero at the
edges and rises gradually to be 1 in the middle. When this window
is used the edges of the signal are de-emphasised and the edge
effects are reduced.

It is important to use a Hamming (or the similar
*Hann* window) in some kinds of analysis,
particularly the frequency domain methods we will look at later. For
many time domain methods the rectangular window is
preferable.

If we are analysing a long signal, then successive windows are taken along the length of the signal. Often we overlap the windows if a Hamming window is being used so that the de-emphasised part of one window becomes the middle of the next.

The *root mean square* amplitude is a
measure of the energy in a speech signal. When applied to
successive windows of a speech signal it gives us a measure of the
change in amplitude over time.

Zero crossing rate measures the number of times a signal crosses the zero line per unit of time. This gives us a measure of the dominant frequency in a signal -- the frequency with the largest amplitude. This is often correlated with the first formant in vowels. ZCR can be useful in differentiating between voiced and unvoiced sounds since unvoiced sounds tend to have a large ZCD while in voiced sounds it is smaller.

A combination of RMS and ZCR can be used to make a simple speech/nospeech decision. ZCR tends to be high in unvoiced sounds, RMS is high in voiced sounds -- a suitable weighted sum will be high for any speech sound and low for non-speech.

Autocorrelation can be used to find the fundamental frequency or pitch of a signal. The technique relies on finding the correlation between a signal and a delayed version of itself. If the delay corresponds to exactly one pitch period, then the signal and the delayed version will co-vary -- that is when one goes up the other will go up, etc -- we say that they will be highly correlated. If the delay corresponds to half a pitch period, they will be opposed to one another or un-correlated. If we plot the degree of correlation (which varies between 1 and -1) vs. the lag between the two signals we get an autocorrelation curve, as shown in the figure. We can see a peak in the curve corresponding to a lag of one pitch period -- and hence autocorrelation can be used to determine the pitch of a signal.

Real speech doesn't give as clean a curve as in the example above (which was from a sinusoid). But as the example here shows we can often see good peaks in the autocorrelation curve of voiced speech (right). Unvoiced speech (left) has a very different autocorrelation pattern and hence autocorrelation is another way of making the voiced/unvoiced decision.

You should again do some exercises from the `Practical Exercises...' worksheets from Harrington and Cassidy. Continue with section 3.5 on windowing signals and then look at the first part of section 4 on speech signal parameters which gives examples of calculating autocorrelation coefficients for a sample signal. Leave the remaining exercises in section 4 until later.

This worksheet is intended to explore the use of autocorrelation for pitch tracking of speech signals. You will use the autocorrelation routines in XlispStat/Emu to look at various speech signals and asses the ease of automatic pitch tracking.

XlispStat/Emu has an autocorrelation routine which can be
accessed by sending the `:autocorr` message to a
`signal` object. The autocorrelation is by
default performed on the entire signal, so if you want to analyse
just a window from the signal you need to window the signal
separately with the `:window` message. For
example:

(def w (send stamp :window :start 2000 :width 1024 :type 'rectangular) (def w-ac (send w :autocorr)) (send w-ac :plot)

You can use the `window-demo` function to preview the
result of applying autocorrelation to different parts of a signal
as follows:

(window-demo stamp :transform :autocorr :width 1024 :type 'rectangular)

Answer the following questions:

For the

`stamp`signal, estimate the pitch of the signal using autocorrelation at regular intervals along it's length.Why can't you estimate the pitch in an unvoiced section like the initial /s/? Suggest a method of differentiating between voiced and unvoiced speech using only an autocorrelation analysis.

**Table of Contents**

- 6.1. The Discrete Fourier Transform
- 6.2. Bandwidth
- 6.3. Decibels
- 6.4. Frequency Domain Filtering
- 6.5. Convolution and Time Domain Filtering
- 6.6. Exercises

The Fourier transform transforms a time domain signal into a
frequency domain representation of that signal. This means that it
generates a description of the distribution of the energy in the
signal as a function of frequency. This is normally displayed as a
plot of frequency (x-axis) against amplitude (y-axis) called a
*spectrum*.

In digital signal processing the Fourier transform is almost always performed using an algorithm called the Fast Fourier Transform or FFT. This is, as its name suggests, a quick way of performing this transform and it gives the same results as the slower Discrete Fourier Transform (DFT) would. We needn't worry here about any of the details of how the transform is carried out, we are interested only in the results. One consequence of using the FFT algorithm is that the length of the signal being analysed must be a power of two, eg. 128, 256, 512, 1024 and so on (although many modern implementations get around this limitation).

Applying the FFT algorithm to a 512 point window of data taken from a speech signal results in a vector of 512 numbers corresponding to the energy at 512 frequencies spanning the range from 0Hz to the sampling frequency. From Fourier's theorem we know that adding together 512 sinusoids at these frequencies will exactly reconstruct the original signal. From Nyquist's theorem we know that the largest frequency component in the original signal must be half the sampling frequency, all frequencies above this are aliased onto lower frequencies. It follows that the second half of the spectrum is the mirror image of the first half:

In fact, the first point in the spectrum isn't reflected and the frequency range from 0 to half the sampling frequency is covered by (N/2)+1 points where N is the size of the window. So from a 512 point fft of speech sampled at 10000Hz we get 257 unique spectral points covering the range 0 to 5000Hz.

For any digital spectrum we can work out the space between
points in on the frequency axes, in the example above this will be
5000/257 or 19.5 Hz. This means that if there are two peaks (or
dips) in the spectrum separated by less than 19.5Hz we won't be able
to resolve them -- they will be blured together. If we used a wider
window, say 1024 points, the space between points on the frequency
axis drops to 9.7Hz, now we can see more detail than before. The
resolving power of a spectrum is called it's
*bandwidth*, refering to the width of the
frequency bands that each point in frequency represents. A large
bandwidth implies few points in the spectrum and hence poor
resolution in frequency; a small bandwidth implies many points in
the spectrum and good resolution in frequency.

There is a tradeoff between resolution in frequency and resolution in time. To get good frequency resolution we need to take a large window of points from the signal. By analysing this large window we blur any spectral changes which happen over it's length -- that is, if two events occur close together in time, a large window will tend to blur them together. Hence good frequency resolution implies poor time resolution and good time resolution implies poor frequency resolution. We can see the result of this tradeoff in wide and narrowband spectrograms (end of Chapter 2 in the text). The wideband spectrogram has a large bandwidth (small window) and so blurs the frequency scale but gives good time resolution -- you can see vertical striations on the spectrogram corresponding to the opening and closing of the glottis. The narrowband spectrogram has a small bandwidth (large window) and so looses time resolution but displays excellent frequency resolution -- you can see horizontal bands corresponding to the harmonics of the pitch frequency.

The figure below shows dB spectra corresponding to a section of the vowel in stamp calculated withh window sizes of 1024 points (red) and 256 points (black). Notice the peaks due to pitch harmonics in the red line which are blurred over in the black.

When the window size is very small, the spectrum generated by the fourier transform will have a correspondingly small number of points and hence will appear very jagged. In order to produce a smoother result in this case the input window can be zero-padded to extend the size of the window without taking more of the input signal. The fourier transform of the zero padded signal will have many more points and the result is that the spectrum appears to have been smoothed out. An example can be seen in Figure 6.1 which shows a raw spectrum taken on 50 points next to a smoother version taken from the same 50 points padded with 500 zeros.

A similar example is shown in Harrington and Cassidy p. 167. The smoothed spectrum has the same peaks and troughs as the original but note that it also introduces new peaks and troughs of it's own. Thus it is important to know when you are looking at a zero padded spectrum -- not all of what you see is characteristic of the input signal.We commonly transform a spectrum as calculated by the fft into a decibel spectrum to better reflect the response of the human ear. The units of amplitude in the raw spectrum shown above are the same as those of the sampled acoustic signal. The decibel scale is a logarithmic scale of relative amplitude where we first devide by a reference amplitude (which corresponds to 0dB) and then take the naturnal logarithm. Finaly we multiply by 20:

AdB = 20*log(A / Aref)

where AdB is the amplitude in decibels, A is the raw amplitude and Aref is the (raw) reference amplitude. There are five basic choices for Aref:

Choose a known standard reference value. This requires calibration of your recording and digitising equipment against some standard.

Take Aref to be the average amplitude in the utterance you took your window from.

Make Aref the peak amplitude value in the spectrum, this makes the peak always have an amplitude of 0dB.

Make Aref the amplitude at a fixed frequency, say 1000Hz, for this spectrum. The spectrum will always now pass through 0dB at 1000Hz.

Make Aref zero, that is just take the log of the raw amplitude value.

The choice of value for Aref depends on the needs of your
application. The big question is whether you want to preserve
differences in amplitude between spectra or between utterances. For
example, if you think that amplitude is going to be a useful cue
then you need to preserve the *relative*
amplitude of your spectra, so you can't use either method 3 or 4
above. If you want to factor out the overall loudness of a recording
then method 2 might be useful. If you want to get precise relative
amplitude measures then you need to use a calibrated system (method
1).

Filtering is an operation on an acoustic signal that changes the shape of the spectrum of the signal. A filter is a device (or an computer program) which applies such a change to a signal. Examples of filters are the bass/treble controls on your stereo which attenuate the higher or lower frequency ranges in the music signal. Each filter has a characteristic spectral shape which describes which parts of the original signal will be attenuated and which will be left alone. An example for a low pass filter might be as shown in Figure 6.2.

The effect of applying a filter to a signal is to modify the
shape of the signal's spectrum. In the frequency domain the effect
of applying a filter to a signal is to *multiply*
the spectrum of the signal by that of the filter. (Note that we
multiply the raw spectra and not the dB spectra -- since dB spectra
are logarithms they should be added to achieve the same effect since
`log(a*b) = log(a)+log(b)`.) The result is a
spectrum that combines the features of those of the input signal and
the filter. An example is shown in Figure 6.3
(taken from Harrington and Cassidy, p. 191) of a simple impulse
train input signal (which has a `flat' spectrum) filtered through a
bandpass filter; the output signal shows the characteristics of both
spectra.

Having applied a filter by combining the spectra, a new filtered signal can be derived by performing an inverse fourier transform on the new spectrum.

The question arises as to how we can find the spectrum of a
filter, or indeed how a filter is represented digitally. A digital
filter can be characterised by an acoustic signal called an
*impulse response* -- literally this is the
result of passing an impulse (a 1 followed by lots of zeros) through
the filter. The spectrum of a filter can be derived by taking the
spectrum of its impulse response.

Filters are an incredibly important topic in the study of speech acoustics. As we will see, the vocal tract acts as a filter which changes the spectral characteristics of the acoustic source generated by the glottis or by frication. Understanding the nature of filters and the separability of the source and filter are fundamental to the study of speech acoustics and form the basis of much of speech technology research.

Although it is possible to apply a filter in the frequency domain by multiplying raw spectra, this is not the most practical way of applying a digital filter. Fortunately there is a time-domain operation that has the equivalent effect: convolution. Convolution is an operation that combines two vectors to give a third vector. If the first vector is an acoustic signal and the second is the impulse response of a filter, then the result of convolution is a filtered signal.

The process of convolution is described in detail in Harrington and Cassidy, Section 5.6, p. 148. It involves a series of multiplications and additions of corresponding points in the two vectors. Rather than reproducing the explanation here I refer you to that discussion.

An important point to take away from the discussion of
convolution is that the operation is equivalent to weighted
differencing of the input signal. The filter provides the weighting
coefficients. Also, remember the formula `y = x*h`
where `y` is the output signal,
`x` is the input signal and `h` is
the filter impulse response. We will see this formula later when we
look at more advanced DSP methods; it provides an important key to
analysing speech signals.

For this section you should complete some of the worksheets on the source filter model from the `Interactive Worksheets on Speech Acoustics' on the Harrington and Cassidy CDROM. Worksheet sections 3.1, 3.2 and 3.3 are relevant. The later sections will be useful when we talk about the source filter model of speech production in the next section.

**Table of Contents**

- 7.1. The Source
- 7.2. The Filter
- 7.3. Combining Source and Filter
- 7.4. Source Filter Synthesis
- 7.5. Source Filter Analysis
- 7.6. Exercises
- 7.7. Further Reading

We are now equipped to begin talking about an important model of speech production: the source filter model. This model is at the heart of many speech analysis methods and drives thinking in speech perception research also.

The view put forward in the source filter model is that speech sounds are produced by the action of a filter, the vocal tract, on a sound source, either the glottis or some other constriction within the vocal tract. Fundamental to the model is the assumption that these are independent -- that is the properties of the filter can be modified without changing the properties of the source and vice versa. This assumption may not be strictly true in all cases but in practical terms provides us with a very useful and largely accurate model of speech production.

There are two acoustic sources in speech production corresponding to voiced and unvoiced speech sounds.

The source in voiced speech is the vibration of the vocal folds in response to airflow from the lungs. This vibration is periodic and if it could be examined independantly of the properties of the vocal tract (which change it's spectral shape) would be seen to consist of a series of broad spikes. Figures 3.4 and 3.5 from Harrington and Cassidy (reproduced in Figure 7.1) show the waveform and spectrum of the glottal source.

The spectrum of the glottal source is made up of a number of frequency spikes corresponding to the harmonics of the fundamental frequency of vibration of the vocal folds. The spectrum decreases in amplitude with increasing frequency at a rate of around -12dB per octave -- that is for each doubling in frequency, the amplitude of the spectrum decreases by around 12dB.The effect of increasing the frequency of vibration of the vocal folds is to widen the gap between the frequency spikes in the glottal source spectrum, since these spikes occur at multiples of the base or fundamental frequency. The overall shape of the spectrum (the -12dB per octave attenuation) remains unchanged.

The average male voice has a source frequency of around 100Hz, female and child voices are typically higher in pitch: around 200 Hz for an average female voice and 200-300 Hz for children.

In unvoiced speech the sound source is not a regular vibration but rather vibrations are caused by turbulent airflow due to a constriction in the vocal tract. The various positions at which the vocal tract can constrict have been discussed in the earlier part of the course (the places of articulation for fricative sounds).

The sound created via a constriction is described as a noise source. It contains no dominating periodic component and has a relatively flat spectrum meaning that every frequency component is represented equally (in fact for some sounds the noise spectrum may slope down at around 6dB/octave). Looking at the time waveform of a noise source we see only a random pattern of movement around the zero axis. The spectrum looks the same, with random peaks and troughs but the overall trend is for it to be flat accross the frequency range.

We have seen that a filter is a system that alters the frequency content of an acoustic signal. The filter in speech production is the vocal tract, a tube of uneven cross section which is closed at one end (by the glottis) and measures around 17.5 cm for the average male. As with any other filter, this tube has a characteristic spectrum. Importantly, this spectrum changes as the shape of the vocal tract changes during speech production. The different qualities of the speech sounds are produced by changing the shape of the vocal tract to produce a particular set of filter characteristics.

To understand the effects of the vocal tract we can model it as a cylindrical tube, closed at one end, which makes the calculation of the filter spectrum much easier. The spectrum of this tube has a set of peaks which correspond to resonant frequencies -- basically the frequencies of vibration which fit into the tube best. For a full discussion of this model see Harrington and Cassidy Section 3.3.1. Resonance is the property which allows you to get a single note out of a half full beer bottle by blowing over the top: the wavelength of the vibration is such that it fits within the tube. Change the length of the tube (by taking a swig of beer) and the note changes because now a different wavelength fits the tube best. In fact you can get more than one note if you are well practiced -- this is how a trumpet can play so many notes with only three `keys'. Each length of tube can fit a number of different resonances but the resonances frequencies are all related to the length of the tube.

The end result then, is that a cylindrical tube has a spectrum with a set of peaks at around 500Hz, 1500Hz, 2500Hz etc. which correspond to the resonant frequencies of the tube. Now, if the shape of the tube is changed, for example by adding a constriction halfway along it, the position of these resonances changes also. In the case of a constriction, the effect is to move the resonant frequencies relative to each other -- so that they are no longer equally spaced. Translating this back to the real vocal tract, the cylindrical tube corresponds approximately to a relaxed vocal tract and the corresponding pattern of resonance matches that of the schwa vowel. Constrictions are introduced by raising the tongue and moving the position of the raised tongue back and forward. This has the effect of modifying the positions of the resonances relative to each other, producing the different vowel sounds.

Another filter is involved in speech production, corresponding to the effect of the lips and the air beyond them on the acoustic signal. This lip-radiation filter has a characteristic spectrum which is a 6dB/octave rise -- that is the lower frequencies are attenuated wheras higher frequencies are not. This spectrum is multiplied by the spectrum of the vocal tract to produce an overall spectrum for the system.

As discussed earlier, the effect of filtering a waveform can be achieved either by convoltion in the time domain or by multiplication in the frequency domain. To produce speech sounds the source (voiced or unvoiced), is passed through the vocal tract filter as described above.

For voiced speech, the combined signal has peaks due to the harmonics of the glottal source superimposed upon peaks due to the resonances of the vocal tract. These peaks of the combined spectrum are formants, which as was discussed earlier in the unit are characteristic features of voiced speech sounds. It is important to note that the formant peak in the resulting spectrum will occur on the closest source harmonic peak to the resonance peak -- hence the resonant frequency and the formant frequency aren't necessarily the same thing.

Fricative speech sounds do not display the harmonic peaks of voiced speech but the characteristic peaks of the vocal tract filter can still be seen. Again, the position of these peaks serves to characterise the speech sound since it is intimately related to the vocal tract configuration that produced it.

In the last section we discussed a simplified model of speech production which used a single tube of constant cross section to model the vocal tract. As was mentioned their, a slightly more realistic model includes a constriction in the tube to represent the position of the tongue during articulation. Fairly detailed models of articulation can be developed by using a small number of tubes to model the vocal tract -- a common model uses one tube for the area behind the tongue, one for the constriction created by the tongue, one for the area front of the tongue and a final tube for the lip apperture. The lengths and cross sectional areas of these tubes can be adjusted to account for a number of speech sounds. This model is very useful in predicting the gross changes of formant positions caused as the tongue is raised or lowered and moved from the back to the front of the mouth.

As discussed in Chapter 3 of Harrington and Cassidy, the
effects of different vocal tract configurations can be explored via
a *nomogram* which is a plot of the predicted
resonant frequencies in a four tube model as the constriction tube
is moved from along the vocal tract. The tube configuration is
illustrated in Figure 7.2 while Figure 7.3 shows an example nomogram.

**Figure 7.3. An example nomogram. The solid line corresponds to a lip
tube area of 4 cm ^{2} (unrounded vowels)
while the dotted line corresponds to a lip tube area of 0.65
cm^{2} (rounded vowels).**

You can explore tube models in the exercises for this lecture. These models illustrates one approach to generating synthetic speech, namely articulatory synthesis. Models with a larger number of tubes can be used to produce more accurate reproduction of various speech sounds. It is also possible to derive a tube model for a given speech sound via LPC analysis -- this won't be covered in this unit but is discussed in Harrington and Cassidy, Chapter 8. Such a derived tube model can be used to re-synthesise speech by applying a suitable source to the model.

The source filter model describes the process of speech production in terms of two independent contributions from the sound source and the vocal tract filter. As a model of production, this allows us to synthesise sounds as described in the previous section, but the model can also be used to aid the analysis of speech sounds. The key is to notice the importance of the independence of the source and filter; this means that each makes a separate contribution to the characteristic features of the resulting speech sounds. The source, in voiced speech, is responsible for the pitch of the sound while the vocal tract filter is responsible for the location of the formants and the overal spectral shape. In a real speech spectrum the overall filter shape and the location of the formants is often drowned out by the effects of the source spectrum. If it were possible to remove the source effects then the two spectra could be studied separately to give a more accurate picture of the precursors of the speech sound.

Removing the effects of the source from a spectrum also has the effect of removing many sources of variability from the signal since many of these are realised in the source rather than in the filter. Hence removing the source can help make, for example, the vowel sounds more distinct from one another and could make vowel sounds from different speakers look more similar.

There are a number of techniques for separating source and filter in a speech signal. In this section I will discuss only one of these: Cepstral analysis.

Cepstral analysis relies on the observation that a logarithmic
speech spectrum is made up from the source and filter spectra added
together. To understand this, recall that filtering in the frequency
domain is achieved by multiplying spectra together. Multiplication
corresponds to adding logarithms and so a filtered spectrum can be
derived by adding logarithmic (or dB) spectra. The procedure for
cepstral analysis is to take the inverse fourier transform of the dB
spectrum, thus converting the signal back into the time domain.
This time domain signal is not a regular acoustic signal, since it
was derived from the logarithmic spectrum; for this reason it is
called a *cepstrum*, sort of a backward
spectrum. The important property of the cepstrum is that since the
dB spectrum is the sum of two spectra, so the cepstrum is the sum of
two components corresponding to the source and the filter. Usefully
the two components are separated: the lower end of the cepstrum
corresponds to the filter while the higher end (or rather the middle
of the reflected cepstrum) corresponds to the source. An
illustration of the process of applying cepstral analysis is shown
in Figure 7.4.

A further operation on the cepstrum is to remove the central part of the reflected cepstrum, the part that corresponds to the source, and perform a fourier transform to again generate a frequency domain version of the signal. Since the effect of the source has been removed from this spectrum, it shows the characteristics of the vocal tract filter. This is manifested as a much smoother spectrum than the original; the degree of smoothing depends upon the number of cepstral coefficients removed prior to the final fourier transform.

The technique of cepstral analysis is described in more detail in Harrington and Cassidy. The technique can be used to generate smoothed spectra which show the characteristics of the filter as described above. The cepstrum can also be used as a means of determining the fundamental frequency of voiced speech since the part of the cepstrum corresponding to the source is often manifested as a single spike. The location of this spike gives a measure of the frequency of the source signal.

Cepstral analysis is an important technique for examining speech signals. You should ensure that you understand the process of generating a smoothed spectrum and the reasons why the process works the way it does. Experimenting with the online examples may help you in coming to terms with it. This is just one of a family of techniques that seek to separate the source and filter in speech analysis. These techniques are useful in that they enable us to remove variability from the speech signal which is due to the source and concentrate our analysis on the vocal tract filter.

In the `Interactive Worksheets' on the Harrington and Cassidy CDROM, Chapter 3 presents some exercises on the source filter model. You should already have looked at the first three sections of this chapter in regard to filters. Section 3.4 covers the glottal source. Sections 3.5 and 3.6 cover tube models of the vocal tract and allow you to visualise the spectrum of a simple tube model as well as see the effect of combining source and filter spectra. You can also use this tool to hear the difference between the filtered and unfiltered source.

Cepstral processing is covered in the `Practical Exercises with Xlisp-Stat' worksheets, in sections 4.2 to 4.4. These worksheets demostrate how to generate a cepstrally smoothed spectrum in Xlisp-Stat and explore the effect of the number of cepstral coefficients retained in the analysis.

You should submit your answers to the *first
three* questions from worksheet 3.6 (Tube Models II).

From the Cepstral processing worksheet (Practical
Exercises... 4.2) submit your estimates of the pitch of the
`stamp` signal estimated from the
cepstrum. Compare these with the estimates you derived via
autocorrelation in an earlier worksheet.

**Table of Contents**

**Table of Contents**

In this session I want to introduce the major components of a
speech recognition system and basically run through a preview of the
contents of this unit. The reading for this session is the chapter by
Power: *The listening telephone -- automating speech
recognition over the PSTN* which is taken from a book
*Speech Technology for Telecommunications* edited
by Westall, Johnston and Lewis. This chapter provides a nice overview
of the field, it does go into a little more detail than I'd prefer at
this point but it's all material that we will be covering in depth
later in the course.

In these notes I'll touch on the major points as I see them in this article.

The domain of this course is the related technologies of speech recognition and speaker verification. Speech recognition is the process of finding a linguistic interpretation of a spoken utterance; typically this means finding the sequence of words that were spoken but, as we'll see later, there is other linguistic information that we could recover, such as prosodic cues. Speaker verification is the problem of verifying that a `voice print' from an claimed user is the same as that stored for that user.

Both of these applications share some components. They must first pre-process the acoustic signal to parameterise it in a more useable and useful form. They must both match the input signal against a stored pattern and then make a decision about accepting or rejecting the match. Each problem has its own difficulties: in speech recognition the number of possible utterances can be very large and differences between speakers confound the problem; in speaker verification only 100% accuracy is really acceptable and differences in a speaker's voice due to the time of day or the microphone used can be as large as the differences between two similar speakers.

Most of this unit will look at speech recognition since this is both the most complex topic and the most used technology. Many of the techniques we learn about are applicable to the verification task as well.

The overall architecture of a speech recognition or speaker verification system is shown in Power's Figure 9.1. This diagram covers both speech and speaker recognition. Notice in particular the two paths through the system for training on the one hand and pattern matching or recognition on the other. The training process is vitally important and involves building some kind of internal model of the speech signal. In speaker verification there is one model for each speaker and the model must capture enough about the speaker's voice characteristics to be able to match that speaker and no-one else. In speech recognition the models might correspond to words, syllables, phonemes or other units. Much of what we'll study in this course is to do with what constitutes a model and how they are compared with incomming speech signals.

This is a topic that isn't often covered in detail but obviously a pre-requisite for any functional speech input system. In many systems there is a defined time when input is listened for, which simplifies the problem a little. Imagine though a `pervasive' recognition system which must be able to listen all the time and process whatever input is recieved, perhaps in a noisy environment. Such a system will need a complex speech/nospeech decision algorithm.

The issue of barge-in is also becoming important as we learn how to implement more intelligent spoken language dialog systems. When the user is able to respond at any time the system needs to be listening even while it is speaking -- just like we do.

We've already begun to explore the problem of how to extract useful information from a speech signal. We looked at techniques for separating source and filter components of the signal and providing parameterised versions of these. The input to speech recognition systems is a sequence of such parameter vectors. We will explore the motivations for selecting the parameters for an application and look at some more techniques for processing the input speech.

The major component of a speech recognition system in terms of complexity is the Hidden Markov Model. An HMM is a statistical model of a class of speech signals which can be used to estimate the probability that a given input sequence belongs to the class on which it was trained. The class can be a word, phoneme or other unit as mentioned earlier. In general we want to use sub-word models as building blocks to produce word or multi-word recognisers; in this case the problem of finding an interpretation amongst the many possible interpretations becomes paramount. It is here that we begin to incorporate models of the language we are recognising to help constrain the search -- we don't consider arbitrary sequence of phonemes or words, only those which fit our models of the language and the task. These issues are discussed in Power's section 9.6 on Speech Recognition.

What is the result of the speech recognition process? When
you use Dragon Naturally Speaking the result is text appearing in
your word processor window. The raw output from the speech
recognition engine though can be richer than this. To begin we can
associate a probability score with the output which measures the
closeness of match between the input and the model. In addition
the search procedure can return more than one sentence or word
hypothesis along with probability scores for each, allowing us to
perhaps try one of the alternatives if they provide a more
reasonable semantic interpretation. As an example, in a banking
application the recognition system might return *My
savings amount* as it's first hypothesis and *My
savings account* as it's second; in a dialog system we
might have additional contextual knowledge which could help us
decide between the two.

The general task of identifying who is talking can be described in terms of two different applications: speaker verification and speaker identification. In speaker verification (SV), an individual make a claim to his or her identity and we must verify that claim by matching a speech sample to a stored model. In speaker identification (SI), no claim is made and we must identify the speaker from a set of known speaker models. Speaker verification is used in security applications where a reliable method of identification is needed; for example for banking transactions over the telephone. Speaker identification might be used in a dialog system to allow tracking of multiple speakers in a dialog or to allow the system to recognise a user from earlier dialogs.

In order to verify a subject the input speech needs to be compared to a stored model for that user. The decision as to whether to accept the identity claim is based on a similarity threshold -- if the input is closer than the threshold, then we accept. However, this ignores all the other user's we know about so it is common to also take into account the similarity of the input to a group of `similar' users; we might say that if the input is closer to the claimed identity than it is to any of the similar users then we accept, otherwise we reject. As you can imagine, there are many variations on this theme.

One of the major considerations in SV applications in the reliability of the verification decision. The classic tradeoff in SV is between false acceptances and false rejections: if the threshold is relaxed, all of the valid claims will be accepted but so will many invalid ones; if the threshold is made more restrictive, most invalid claims will be rejected but so will many valid ones. In order to objectively evaluate a system's performance a measure called the Equal Error Rate (EER) is often quoted. EER measures the error rate of a system when the threshold is adjusted so that the number of false acceptances is equal to the number of false rejections. In a real application, the error performance can be adjusted to suit the level of security required: secure or convenient.

The speaker model used in SV or SI systems can be any kind of model of the acoustic signal. HMMs are common as is a method called vector quantisation; we'll discuss the details later. Unlike in speech recognition, the goal is to retain as much speaker specific information as possible in the signal.

**Table of Contents**

In this section we revisit the subject of the last part of
SLP801 to look at how the speech signal is processed prior to
matching against stored patterns. To provide an overview of the goals
and methods of this stage we will look at a paper by Joe Picone
(*Signal Modeling Techniques in Speech
Recognition* from the Proceedings of the IEEE, 1993). This
paper describes this stage as *Signal Modeling*
to emphasis that we are attempting to derive a parameterised version
of the input speech signal which captures it's important qualities
while discarding unimportant and distracting features. In addition,
we will refer to the last part of Chapter 6 (6.6 onwards) and Chapter
8 from Harrington and Cassidy for details on one signal modelling
technique.

No two utterances of the same word or sentence are likely to give rise to the same digitial signal; this obvious point underlies the difficulty in speech recognition but also means that we may be able to extract more than just a sequence of words from the signal. If we can understand the different sources of variability in the signal then we can begin to approach the problem of either separating them out for subsequent analysis stages.

Let's consider the factors which could cause two random speech samples to differ from one another:

*Phonetic identity*: the two samples might correspond to different phonetic segments, eg. a vowel and a fricative. Another source of variability is coarticulation between phonemes.*Pitch*: pitch and other source features such as breathiness and amplitude can be varied independently.*Speaker*: different speakers have different vocal tracts and source physiology. Speakers also get colds, get emotional and do other things to modify their voice properties.*Microphone*: and other properties of the transmission channel (eg. fixed vs. mobile telephone).*Environment*: background noise, room acoustics, distance from microphone.

The requirements of the signal modelling stage will differ depending on the application environment: telephone speech recognition applications may need to use techniques to compensate for line noise and changes in the properties of the telephone line during each utterance; an in-car system will need to be robust to significant amounts of background noise. These may call for different signal modelling techniques in addition to different approaches to pattern matching.

The aim of signal modelling is to derive a feature vector such that the vectors for the same phoneme are as close to each other as possible while the vectors for different phonemes are maximally different to each other. In other words, the feature vectors we derive for all /a/ vowels should be very similar to each other (by some metric that we have yet to define) while being different to those for any other phoneme class.

This goal can best be achieved by finding a signal transform that isolates the variability due to phonetic class from the other sources of variability in the speech signal.

In order to understand LPC and many other signal processing
techniques it is necessary to look at the mathematics behind the
*z-transform*. Unfortunately this is quite a
hairy topic for a non-mathematical audience and so can be a hurdle
in properly understanding these techniques. In our book we try to
give a simple step-by-step coverage of the z-transform and how it
relates to digital filtering and I would like to walk through the
major points in these notes. The aim of this exercise is so that
you can look at, for example, equation 18 in Picone's paper and not
only stay upright but maybe even say, "hmm, I see".

The story begins back in Chapter 6 on page 178. If you remember back to 801, we talked about frequency domain filtering as a mulitplication operation between the spectra of the source and filter. In this section we develop a new way of talking about the frequency domain, the z-transform, such that we can characterise a signal and a filter as a polynomial, and perform the filtering operation by multiplying these together.

Section 6.6.1 explains how the amplitude and phase of a
digital signal can be expressed as a vector of complex numbers and
in particular that the spectrum of an impulse [1 0 0 0 ...] can be
expressed as the vector X[k] = 1 + 0i for all k. This corresponds
to saying that the unit impulse is the sum of sinusoids at all
(digital) frequencies with amplitude 1 and phase 0 radians.
Following this we see that the spectrum of a time shifted version
of this impulse signal differs only in the phase component and can
be neatly expressed via Euler's relation as X[k] = Aexp(-iWkp)
(where exp denotes the exponential and the signal has an amplitude
of A and has been shifted by P points relative to the original unit
impulse). We then define a new variable z = exp(iWk) so that the
shifted spectrum can be written as X(z) =
Az^{-p} -- a polynomial in z or the
z-transform of the shifted impulse signal.

We next note that any signal can be seen as the sum of weighted, shifted, impulse signals ([2 3 4 5] = 2*[1 0 0 0] + 3*[0 1 0 0 ] etc). Hence (since we are dealing with linear time-invariant signals) the z-transform of an arbitary signal can be derived from the z-transforms of the shifted impulses.

(note that there's a typo in this example in the book (top of p185) where the exponent of the third term is -3 instead of -2.)x[n] = [4 2 1 3 0 0 0 0]

X(z) = 4 + 2z

^{-1}+ z^{-2}+ 3z^{-3}

So, now we know what a z-transform is (a way of writing the spectrum of a signal) and we can derive the z-transform for any digital signal we come across. Note that by substituting z = exp(iWk), where k is a vector of digital frequency values, into any z-transform we can get back to the DFT of a signal.

We know that we can apply a digital filter either by convolving a time domain signal with the impulse response of the filter or by multiplying the fourier spectra of the signal and the filter. Section 6.6.2 describes how we can also apply a filter by multiplying the z-transforms of the signal and the filter. In 6.6.3 we see that writing the source filter equation in terms of z-transforms:

allows us to combine the recursive (A(z)) and non-recursive (B(z)) parts of the filter into a singleY(z)A(z) = B(z)X(z)

Figure 6.16 summarises the relationship between signals, spectra and z-transforms. The impulse response and the spectrum are two ways of characterising the effect of filter; either can be derived from the convolution equation and we can transform between them using the DFT and IDFT.

Moving now to Chapter 8 we will begin to look at Linear Predictive Coding. LPC is another method of separating out the effects of source and filter from a speech signal; similar in intention to cepstral analysis but using quite different methods. One way of thinking about LPC is as a coding method -- a way of encoding the information in a speech signal into a smaller space for transmission over a restricted channell. LPC encodes a signal by finding a set of weights on earlier signal values that can predict the next signal value:

If values for a[1..3] can be found such that e[n] is very small for a stretch of speech (say one analysis window), then we can transmit only a[1..3] instead of the signal values in the window. The speech frame can be reconstructed at the other end by using a default e[n] signal and predicting subsequent values from earlier ones. Clearly this relies on being able to find these values of a[1..k] but there are a couple of algorithms which can do this (one is covered in the book). The result of LPC analysis then is a set of coefficients a[1..k] and an error signal e[n], the error signal will be as small as possible and represents the difference between the predicted signal and the original.y[n] = a[1]y[n-1] + a[2]y[n-1] + a[3]y[n-3] + e[n]

There is an obvious parallel between the LPC equation and that of a recursive filter (y*a = x):

where we have rearranged the terms as in Equation 8.9 in the text. The LPC coefficients correspond to those of a recursive filter and the error signal corresonds to a source signal. Moreover, the conditions under which the error signal is minimised in LPC analysis mean that the error signal will have a flat spectrum and hence that the error signal will approximate either an impulse train or a white noise signal. This is a very close match to our source filter model of speech production where we excite a vocal tract filter with either a voiced signal (which looks like a series of impulses) or a noise source. So, LPC analysis has the wonderful property of finding the coefficients of a filter which will convert either noise or an impulse train into the original frame of speech.y[n] = -a[1]y[n-1] - a[2]y[n-1] - a[3]y[n-3] + ... + x[n]

The result isn't quite perfect; as pointed out on page 214 the filter coefficients derived by LPC analysis contain information about the glottal source filter, the lip radiation/preemphasis filter and the vocal tract itself. However since these are much less variable than the vocal tract filter we can factor them out in practice (eg. by preemphasis before LPC analysis).

Why did we need to know about z-transforms to cover LPC analysis? Well, if this were as far as we were going then we didn't need z, but LPC is really just a way in to some more interesting signal analysis techniques.

The LPC coefficients make up a model of the vocal tract shape that produced the original speech signal. A spectrum generated from these coefficients would show us the properties of the vocal tract shape without the interference of the source spectrum. From our earlier discussion we know that we can take the spectrum of the filter in various ways, for example by passing an impulse through the filter and taking it's DFT, or by substituting for z=exp(iWk) in the z transform of the signal. Either way, the result can be quite useful in signal analysis.

Looking at an LPC smoothed spectrum of voiced speech we can clearly see the formant peaks; they tend to be much more well defined than in a cepstrally smoothed spectrum. As discussed on p223, we can use the z-transform notation to find the locations of these formant peaks for a given set of LPC coefficients, corresponding to the points at which A(z) is zero. This is the key to automatic formant tracking of speech signals -- derive the LPC coefficients, solve the z-transform equation and record the resulting formant positions. Unfortunately since the LPC model isn't a perfect fit to real speech production (it assumes a lossless, all pole model, for example) this method will derive spurious formants; most of the work in a good formant tracking program is working out which of the candidiate formants is the real thing.

LPC coefficients can also be used to derive cepstral coefficients and area functions as described in the remainder of Chapter 6. LPC is a powerful signal modelling technique and is very important in speech recognition and speech analysis.

The paper by Picone gives a nice framework for thinking about the signal modelling/feature extraction process, I'll briefly talk through his categories here. In reading the paper you will find a lot of technical detail since his audience are electrical engineers who are very comfortable with this. It is not necessary for you to digest all of the equations etc but you should at least get the general idea in most cases.

I will only cover the first three stages outlined by Picone, leaving the topics covered in Statistical Modelling until the next session.

Before any kind of features are derived from the signal it must first be digitised and prepared for analysis. As we discussed in SLP801 the digitisation process requires that all frequencies above the Nyquist frequency be removed before digitisation to avoid aliasing. After digitisation a digital filter is applied to pre-emphasise the signal; this boosts the high frequencies and, as Picone explains, can be said to be compensating for the -6db/octave rolloff of the voiced source or to simulate our more sensitive hearing above 1kHz. Either way this has been shown to be a useful preparatory operation on speech signals.

In this section Picone discusses a number of analysis methods which derive frequency domain parameters from the speech signal. This is the core of signal modelling where we are finding numerical vectors which encode the kind of variability we are most interested in whilst discarding spurious information. In common with all of these procedures is the use of a time windows which is stepped along the input speech signal selecting frames for analysis; each frame is then turned into one feature vector.

While fundamental frequency isn't generally a useful feature in English speech recognition it can be useful in deriving prosodic information and is becoming useful in tone languages such as Mandarin where different pitch contours can cue lexical differences. Picone describes the families of algorithms that are used to derive pitch contours from a speech signal.

The power of a signal is easily measured and is useful as an adjunct to other feature sets.

The filter bank approach is equivalent to passing the signal through a set of band-pass filters of varying widths and centre frequencies and using the mean amplitude of the output signal from each filter as the feature vector. A linear filter bank would consist of a set of equal width filters spaced at, say, 100Hz intervals (100, 200, 300 etc); a non-linear filter bank uses different width filters at non-linear spacings. A common approach is to try to model the response of the human ear and have many narrow filter banks below 1000Hz and gradually wider bands above; the widths and placements of these can be defined by the mel, Bark or Erb scales.

While filter banks can be implemented as filters it is more common to use a fourier spectrum and derive the filter bank values from that. Picone calls this the Digital Filter Bank.

We've covered cepstral analysis already and as Picone says this is one of the most important analysis methods in ASR. It is common to use a frequency warped version of the cepstrum, typically the mel scale is used. These are calculated by frequency warping the log spectrum prior to taking its DFT (remember that a cepstrum is the DFT of a log spectrum).

The only point I want to take from this section is the use of differencing to generate additional feature sets for input to the recognition stage. It is commonly believed that in addition to the static per frame features some information about phonetic identity is encoded in the way that the speech signal changes through time. One simple way of encoding this information is to take the difference between successive feature vectors and use that as an additional input to the pattern matching stage. The difference operation approximates taking the gradient or first derivative of the feature vectors with respect to time. In some cases it might be useful to include the second derivative or acceleration which can be approximated by differencing the first derivative vectors. It is now common practice to include at least the first difference as input to a speech recognition system.

**Table of Contents**

- 10.1. Spaces and Distance Measures
- 10.2. Gaussian Classifiers
- 10.3. Vector Quantisation
- 10.4. Principal Components Analysis

Having addressed the issue of turning the input speech signal into a sequence of feature vectors, we now begin to look at the problem of comparing an unknown input signal with a stored model. Before getting to the more general problem of comparing sequences of feature vectors we look at comparing single feature vectors with each other and with collections of feature vectors. Reading material for this section is Chapter 9 from Harrington and Cassidy and the final part of the Picone feature extraction paper from last time.

How can two feature vectors be compared together? A feature
vector might consist of 10 or 50 numbers which represent the
essential properties of the input signal. We said earlier that the
ideal feature set would make similar phonemes be close together
while making dissimilar phonemes be distant from one another. The
simplest way to develop the idea of *distance*
between feature vectors is to reduce the number of features to one
or two so that we can more easily visualise them.

Comparing two values of a single feature like pitch has an obvious solution: subtract one value from the other, the smaller the difference the closer the two values are to one another. Moving to two features, such as the first two formants, we can easily visualise a solution by plotting values on a graph. We can measure the distance between the points on the graph to give a direct comparison of the two feature vectors. As discussed in Chapter 9 this distance can be calculated from the difference between the F1 values and the F2 values and is known as the Euclidean distance: this corresponds to the length of the straight line that joins the two points on the graph.

In the above example we call the single feature a
one-dimensional vector and the two feature case a two-dimensional
case. More generally we call an *n*-feature
vector an *n-dimensional* vector. Up to three
dimensions can be visualised easily since the correspond to the
width, height and breadth of our three dimensional world. Beyond
this we can't form a mental picture of two points defined by, for
example, vectors of 36 values; however, the mathematics is the same
and is easy to understand.

The Euclidean distance measure is the simplest measure in
general use but it's not the only way of comparing two feature
vectors. Since calculating the Euclidean distance involves taking
square roots it is quite expensive to calculate and requires
floating point calculations. An alternative is to use the
*city block* or *Manhattan*
distance which is just the sum of the differences in each dimension
-- in two dimensions this is the distance of the shortest path if
you're limited to a grid as in Manhattan.

Having defined a way of comparing two feature vectors we have one method of classifying an unknown feature vector: by comparing it with a set of known prototype vectors selected to be typical of, for example, each phoneme in the language. The closest known vector will define the identity of the unknown vector. If we ignore the time varying nature of speech for the moment and imagine taking one vector per phoneme, then this might provide one way of identifying the phonemes in a speech signal.

Two problems with the prototype approach are that it assumes that we have some way of finding a prototype that is representative of each category as a whole and it takes no account of the variability of phoneme categories. Both of these problems are solveable in various ways and the technique of Gaussian modelling addresses both of them while providing a theoretically sound method of making a decision between categories.

As you've seen from the simple studies you did in SLP801 of your own vowels the parameter for one category (vowel) are clustered around a central point in the feature space (the F1/F2 plane in this case). The figure below gives an example. We can also observe that each category is differently distributed; the shape and orientation of the ellipses differs: /E/ is tightly clustered while /I/ and /@/ are more spread out and overlap significantly.

**Figure 10.1. The distribution of vowels in the F1/F2 plane. The red and
blue points show two unknown vowels.**

The first problem of which prototype vector to choose to compare unknowns to has a simple answer: we can take the mean of a set of vectors as the prototype. The mean is by definition maximally close to all of the set of points used to calculate it, with a suitable sample of phonemes the mean should be representative of the category.

The set of vectors we use to derive the mean is known as the training set, we will use this term repeatedly in our discussion of training speech recognition systems. When we come to evaluate our models we will use a second set of data, the testing set, and it's important that these be different samples as we'll discuss later.

The second problem is to take some account of the way in which the categories are distributed. Consider the two unknown points shown in Figure 10.1. The red point is much closer to the /E/ centroid than that for /@/ but it is more likely to be an example of /@/ because it is on the edge of the ellipse for that vowel whereas it is further from the ellipse for /E/. The ellipse shows the spread of vowels in the training set and the proximity of the unknown to the ellipse indicates that there were probably examples of /@/ in the training set in that region. Had there been examples of /E/ near the red point the ellipse for /E/ would have extended in that direction.

The second unknown in Figure 10.1 is within three ellipses; to determine a category for this vowel we need some way of measuring closeness to the centroid which take into account the spread of the training data. The answer lies in the Gaussian model.

To calculate the ellipses in these eplots we are in fact fitting a Gaussian model to the data. A Gaussian model is an extension of the one dimensional normal curve which you've probably come across before. The normal curve shows the distribution of values for some variable around a mean position; any normal curve can be described by two parameters, the mean, which defines the centre of the curve, and the variance which defines the width of the curve. The equation for the Gaussian curve is given in Harrington and Cassidy Chapter 9, equation 9.9 on p 247.

In more than one dimension a Gaussian model is characterised again by two values a mean, which is now a vector rather than a single value, and a covariance matrix. The covariance matrix describes not only the variance of each dimension but the way that the vary together: for example if two dimensions are correlated they will have a hight covariance value. In two dimensions the covariance matrix defines the size and orientation of the ellipse in our ellipse plots -- typically we draw these at about 2.4 times the standard devaition of the data, a value which on average will include 95% of the data points. The equation for the multi-dimensional Gaussian is again given in Harrington and Cassidy; it looks a lot hairier than the one dimensional case but as long as you realise that it's the same in essence it should give you no problems.

So the Gaussian model can be used to characterise a group of feature vectors of any number of dimensions with two values: a mean vector and a covariance matrix. How does this help in classifying unknown vectors? They key is to understand the Gaussian model as a probabilistic model. Chapter 9 goes into the basic theory of Bayesian probability theory which is based on counting the relative occurences of observations. The measure of interest is the posterior probability that a token belongs to a type given an observation ( P(type|observation) ). As the book explains, it's hard to reliably estimate this measure but we can calculate it given two other measures of the prior probabilities of the type and observation and the conditional probability P(observation|type).

The Gaussian model is one way of calculating P(observation|type). Each type is characterised by a single Gaussian and the probability is calculated by measuring the height of the curve (or more generally the value of the function) at the coordinates specified by the input vector. This conditional probability measure is converted to a posterior probability via Bayes formula using the prior probabilities of the observation and the type.

The posterior probability can be used in the same way as the Euclidean distance measure we developed earlier to compare an unknown against a set of models. However, now the models are Gaussian models rather than single points and the probability will be higher if the unknown is close to the model. The decision between types can be made by finding the model for which the probability measure is largest. This is illustrated in Figure 9.7 from Harrington and Cassidy (reproduced in Figure 10.2) which shows a one dimensional example.

**Figure 10.2. The conditional probability density for two types on a
single parameter. Point a would be assigned to
type t_{1} while point
b would be assigned to
t_{2}.**

The use of a Gaussian model gives us a solution to the second problem of taking the shape of the distribution of vectors into account. The shape of the distribution is encoded in the mean and covariance of the Gaussian curve and so the probability calculation will take account of this. In the example above (Figure 10.1) the red point will have a higher probability of belonging to /@/ since it's only 2.4 standard deviations away from the centre whereas it's more than 3 standard deviations away from the centre of the /E/ cluster.

An important point arises as we begin to discuss training statistical models of which the Gaussian model is the simplest example. As described above we are estimating a mean and covariance matrix for each type under study. For a 30 feature input vector the mean consists of 30 numbers and the covariance matrix of 30*30 or 900 numbers. A general statistical principle applies here which says that any summary statistic needs to be supported by a sufficient amount of data. For example if we have only 10 input vectors to train a particular Gaussian model then these 10 sets of 30 numbers will be the entire support for the 930 numbers required for the model; this may well cause the models to be very innacurate.

We will often seek to examine the usefulness of a new feature we've developed. Adding this new feature might add another five elements to the overall feature vector, in the example above this takes us from 930 to 35+(35*35)=1260 numbers for each model. If the same number of training vectors is used this will mean that they are spread more thinly and that our new model may be less accurate and perform more poorly, rather than better after the new parameter is added. This is known as the `curse of dimensionality' and is discussed in Chapter 9 of Harrington and Cassidy.

Any statistical model needs to be trained with an adequate amount of data. You should be aware of the number of free parameters in any model you are training and be mindful of the relation between the number of training vectors and the complexity of your model.

To summarise then, we can take a set of training vectors each representing a different type of speech sound and for each type calculate the mean and covariance matrix for a Gaussian model. This model can then be used to find the probability of any unknown vector and the unknown can be assigned to the model which provides the highest probability.

More can be said about Gaussian classification and the book has a discussion of the development of discriminant functions and variations on the Bayesian rule which you should read and digest. An important section of the book is 9.4 which discusses classification experiments using Gaussian models to compare the usefulness of two alternate feature sets. You should read this section as it forms the basis of the first practical assignment.

One application of distance measures that has been important in automatic speech recognition is known as Vector Quantisation or VQ. VQ is a data reduction method which means that it seeks to reduce the number of dimensions in the input data so that the models used to match unknowns can be as simple as possible. VQ reduces dimensionality quite drastically since it encodes each vector as a a single number.

Imagine dividing the feature space (think of the F1/F2 plane) into equal sections, say every 500 Hz. This gives us a number of regions, let's say there are 16 of them. One way of describing an F1/F2 pair is to say which square it falls into. If we did this and used the square number as the feature then we'd only have one input to our Gaussian model rather than two and the computational cost of comparing with an unknown would be much less. It should be clear that the approach of dividing into squares would not work properly since one square might bisect a natural cluster such as that for the /E/ vowel. A better approach would be to define regions based on the clusters of points on the plane.

The problem we are faced with now is how to define the shapes
of the clusters. The usual approach is to take a set of training
data and use a clustering algorithm that finds the natural cluster
centroids in the data. Such an algorithm is the k-means clustering
algorithm which is described quite cryptically in the Picone paper
we looked at last week (p. 1239). This algorithm begins by choosing
*k* vectors as starting centre points, it then
assigns each training point to the closest of these centres, finally
it works out new centre points by taking the mean of the
clusters. These three steps are repeated until the centre points
aren't changing significantly. This procedure is not guaranteed to
find the best set of clusters but in practice it does find a good
set in most speech applications. An example of a two dimensional
space split up by the k-means algorithm is shown in Figure 10.3 which was generated with the following R
commands:

library(emu) library(mva) segs <- emu.query("isolated", "jc:*", "Phonetic=vowel") data <- emu.track(segs, "fm", cut=0.5) cl <- kmeans( data[,1:2], 16, 20 ) plot( data[,1:2], col=cl$cluster )

This algorithm will work with any number of dimensions and gives us a set of cluster centres which divide up the space as naturally as possible. We can now assign any feature vector to one of these clusters and use the cluster number as an input feature instead of the original vector. Hence, instead of trying to compare two sequences of 25 dimensional feature vectors we will be comparing two sequences of integers.

An added advantage for pattern comparison is that we can pre-compute the distance between pairs of clusters. The distance between two clusters can be definied as the Euclidean distance between thier centers. Now when we need to make a distance comparison, no expensive multiplication operations are required since the value can be looked up in a table.

In order to make the VQ approach workable the number of
clusters needs to be quite large to provide sufficient
discriminatory power. It is common to use cluster sizes which are
powers of two since a common consideration is the number of binary
digits needed to represent the input string. Common value is 64 or
128 clusters which require 6 and 7 biniary digits respectively. The
set of cluster centers is described as a
*codebook* since it is used to translate feature
vectors into single numbers. You will see reference to codebook
size which refers to the number of entries or clusters in the
codebook.

When VQ is used to encode an input sequence, information is lost. We are grouping together dissimilar points and effectively representing every cluster member by the cluster mean. One way of measuring this loss of information is to look at the difference between the original input vector sequence and the quantised version. The average value of this difference is called the VQ distortion and gives a measure of how well a particular set of clusters fits a set of data. The larger the codebook the smaller the VQ distortion since each vector can be assigned to a closer cluster centre.

We will revisit VQ codebooks when we begin to talk about matching sequences of feature vectors against stored patterns.

We discussed sources of variability last week in relation to desirable feature sets for speech signals. What would be very nice would be a technique for separating out the useful sources of variability from the noisy ones. The source filter model provides one such technique, preserving variability due to the filter while discarding information about the source. A more general statistical technique is known as Principal Components Analysis (PCA).

PCA is covered in Chapter 9 of Harrington and Cassidy and is described there as “a data-reduction method that finds an alternative set of parameters for a set of tokens such that most of the variability in the data is compressed down to the first few parameters”. The thing that makes PCA useful is knowing that the various sources of variability in our data are independant of one another.

**Figure 10.4. A two dimensional space showing the direction of maximum
variance. PCA seeks to rotate the space such that this direction is
parallel to the x-axis.**

Mathematically, PCA finds a rotation of the feature space that places most of the varaibility in the data in the first dimension. This can be illustrated best in two dimensions (Figure 10.4). The figure shows some vowel data on the F1/F2 plane and indicates the direction in which the variance of the data is a maximum. PCA would seek to rotate the space such that this direction was the new x-axis. The idea is that now that most of the variance is concentrated in one dimension, the second dimension might not be needed to separate the types under study.

In higher dimensional spaces we can't picture the rotation but
it happens in the same way. The PCA process finds the direction with
the most variance and rotates the space such that this is now the
first dimension. It then finds the direction with the next largest
variance and rotates the space such that this is the second
dimension. This process continues until all dimensions are accounted
for. The result is a new feature space with *the same
number of dimensions* as the original space but where the
variance is concentrated in the lower order dimensions.

The point of this process is to be able to safely discard some of the higher order dimensions and this achieves two goals: firstly it may remove some noisy sources of variability, secondly it reduces the dimensionality of the input which makes our statistical models simpler.

**Figure 10.5. Two plots showing vowel data. On the left the first two
principal components, on the right automatically tracked formants.
Each plot shows averaged values over the same set of 172 vowels
taken from one male and one female speaker.**

As an example, we know that the major source of variability in vowel data is the variation in place of articulation -- height and backness. All other sources of variability are of a much smaller magnitude. If we take bark scaled spectral data averaged over the length of each vowel (ie. one Bark scaled spectrum per vowel), perform a PCA and plot the first two principal components we get the example shown in Figure 10.5. The first two dimensions of the rotated space can be seen to be very closely related to the first two formants as shown in the second plot in Figure 10.5. The first principal component correlates well with vowel `backness' while the second PC correlates with vowel `height'. Note that the units and absolute values of these two features are no longer significant. PCA can achieve a similar separation of vowels in two dimensions as automatically tracked formants but without making any assumptions about the nature of the signal under study (ie. this is not a `model based' analysis method).

PCA isn't generally used in speech recognition systems but it is a useful technique to apply when studying speech acoustics. You should follow this topic up in the text, Chapter 9.

**Table of Contents**

- 11.1. The Problem with Time
- 11.2. Dynamic Time Warping
- 11.3. Practical DP Matching
- 11.4. Additional Topics

Speech is a temporal signal; by this I mean that the Linguistic information encoded in an acoustic speech signal is encoded in both the properties of the signal at any given instant and the way that those properties change over time. A case in point is a diphthongal vowel which can only be identified by looking at the way that it changes over time -- moving from one target location to another. To recognise a continuous speech signal we must develop a method for comparing sequences of observations with stored patterns.

In a later session, we will look at Hidden Markov Models which are statistical models of observation sequences. They provide a method of directly estimating the conditional probability of an observation sequence given a hypothesised identity for the sequence. An HMM is trained using example sequences and can be thought of as the extension of the Gaussian model into the temporal dimension.

In this session we will look at an older technology which provides some insight into the problem of matching sequences efficiently.

One of the earliest approaches to isolated word speech recognition was to store a prototypical version of each word (called a template) in the vocabulary and compare incoming speech with each word, taking the closest match. This presents two problems: what form do the templates take and how are they compared to incoming signals.

The simplest form for a template is a sequence of feature vectors -- that is the same form as the incoming speech. We will assume this kind of template for the remainder of this discussion. The template is a single utterance of the word selected to be typical by some process; for example, by choosing the template which best matches a cohort of training utterances.

Comparing the template with incoming speech might be achieved
via a pairwise comparison of the feature vectors in each. The total
distance between the sequences would be the sum or the mean of the
individual distances between feature vectors. The problem with this
approach is that if a constant window spacing is used, the lengths
of the input and stored sequences is unlikely to be the
same. Moreover, within a word, there will be variation in the length
of individual phonemes: *Cassidy* might be
uttered with a long /A/ and short final /i/ or with a short /A/ and
long /i/. The matching process needs to compensate for length
differences and take account of the non-linear nature of the length
differences within the words.

The Dynamic Time Warping algorithm achieves this goal; it finds an optimal match between two sequences of feature vectors which allows for streached and compressed sections of the sequence. The paper by Sakoe and Chiba (Dynamic Programming Algorithm Optimisation for Spoken Word Recognition) gives a detailed description of the algorithm; I will summarise it here.

We can arrange the two sequences of observations on the sides of a grid (Figure 11.1) with the unknown sequence on the bottom (six observations in the example) and the stored template up the left hand side (eight observations). Both sequences start on the bottom left of the grid. Inside each cell we can place a distance measure comparing the corresponding elements of the two sequences.

To find the best match between these two sequences we can find a path through the grid which minimises the total distance between them. The path shown in blue in Figure 11.1 gives an example. Here, the first and second elements of each sequence match together while the third element of the input also matches best against the second element of the stored pattern. This corresponds to a section of the stored pattern being stretched in the input. Similarly, the fourth element of the input matches both the second and third elements of the stored sequence: here a section of the stored sequence has been compressed in the input sequence. Once an overall best path has been found the total distance between the two sequences can be calculated for this stored template.

The procedure for computing this overall distance measure is to find all possible routes through the grid and for each one of these compute the overall distance. The overall distance is given in Sakoe and Chiba, Equation 5, as the minimum of the sum of the distances between individual elements on the path divided by the sum of the warping function (which will be discussed later). The division is to make paths of different lengths comparable.

It should be apparent that for any reasonably sized sequences, the number of possible paths through the grid will be very large. In addition, many of the distance measures could be avoided since the first element of the input is unlikely to match the last element of the template for example. The DTW algorithm is designed to exploit some observations about the likely solution to make the comparison between sequences more efficient.

The major optimisations to the DTW algorithm arise from observations on the nature of good paths through the grid. These are outlined in Sakoe and Chiba and can be summarised as:

*Monotonic condition*: the path will not turn back on itself, both the i and j indexes either stay the same or increase, they never decrease.*Continuity condition*: The path advances one step at a time. Both i and j can only increase by 1 on each step along the path.*Boundary condition*: the path starts at the bottom left and ends at the top right.*Adjustment window condition*: a good path is unlikely to wander very far from the diagonal. The distance that the path is allowed to wander is the window length*r*.*Slope constraint condition*: The path should not be too steep or too shallow. This prevents very short sequences matching very long ones. The condition is expressed as a ratio n/m where m is the number of steps in the x direction and m is the number in the y direction. After m steps in x you must make a step in y and vice versa.

By applying these observations we can restrict the moves that can be made from any point in the path and so restrict the number of paths that need to be considered. For example, with a slope constraint of P=1, if a path has already moved one square up it must next move either diagonally or to the right.

The power of the DTW algorithm goes beyond these observations though. Instead of finding all possible routes through the grid which satisfy these constraints, the DTW algorithm works by keeping track of the cost of the best path to each point in the grid. During the match process we have no idea which path is the lowest cost path; but this can be traced back when we reach the end point.

A path through the grid is written in the paper as
*F=c(1),c(2)...c(K)*, the generalised element of
the path is *c(k)* and this consists of a pair
of coordinates in the *i* (input) and
*j* (stored) directions. The i coordinate of the
kth path element is *i(k)*.

The weighting function *w(k)* introduced
into the overall distance measure is used to normalise for the path
length. Two alternate weighting functions are presented: symmetric
and asymmetric. Both functions are derived from the distance
travelled (in grid units) in the last step of the path. The
symmetric form combines the i and j directions while the asymmetric
form uses just the i direction.

If the path has just made a diagonal step then i and j both increase by 1 and the symmetric w(k) = 1+1 = 2; the asymmetric w(k) = 1; The sum of this function over the length of the path gives a measure of how long the path is. This is used in normalising the overall distance measures.

The Dynamic Programming algorithm is a well known optimised search algorithm which finds the least cost path in a grid. It works by first evaluating the least cost distance to reach every square in the grid and then tracing back the path which corresponds to the overall least cost distance.

The DP match algorithm presented in the paper for the simple case of no slope constraint (P=0) can be stated as follows:

Initialise an array

*g*to hold the lowest cost path to each grid square.Set g(1,1) = 2d(1,1) -- ie the lowest cost to the start square is twice the distance between the two first elements (the 2 is w(1) for the symmetric form of w(k)).

For each row i and column j, calculate g(i,j) as the minimum of the costs of the three possible ways to get to (i,j) plus w(k)*d(i,j). Don't consider points where |i-j| >= r -- ie outside the adjustment window.

The overall distance is 1/N * g(I,J) where I and J are the lengths of the input and stored patterns respectively.

This algorithm involves one simple calculation per grid square and relies on the observation that there are three ways to get to any grid square (horizontally, vertically or diagonally). In the end we have a measure of the cost of the best path but we don't know the path that gave rise to it. In most situations we're not really interested in the path but if we do want it it can be reconstructed by keeping track at each grid square which of the three directions gave the minimum cost path. The path can then be traced back from the endpoint.

When the slope constraint is not zero, the algorithm is the same but the calculation of g(i,j) is a little more complicated. The different combinations are shown in Table I of Sakoe and Chiba.

One issue that arrises in isolated word recognition is separating speech from silence in the input signal. Each template and input sequence must correspond to a single word; the user is required to pause between words to make this segmentation possible.

One simple method of making a speech/nospeech decision is to use a combination of zero crossings and RMS energy. Together these two features provide a reasonable separation of speech and silence since low energy speech (fricatives) tends to have high zero crossing rates and low ZCR speech (vowels) tend to have high energy. The speech signal is segmented into words before being passed to the front end of the recognition system.

So far I have assumed that a single recording of each word is used as the basis for the stored template in a DTW based recogniser. This approach will not be very robust since it takes no account of the variability of different utterances and does not ensure that the template is representative of the class as a whole. A number of methods have been proposed which seek to address these issues.

The DTW technique is obviously suited to isolated word recognition since the start and end of the words needs to be known before the match can proceed. Advanced versions of DTW were developed to match sequences of words in connected speech. One approach involved a two level dynamic programming search where word matches were evaluated at different points in the input (Sakoe, 1979, Two level DP matching, IEEE Transactions on Acoustics Speech and Signal Processing 27(6):588-595).

This approach has been superceded by the HMM approach but in fact the search for interpretations in an HMM is a very similar process to the DP match used here.

**Table of Contents**

- 12.1. Temporal Signals
- 12.2. Applying HMMs

This session introduces the most common current technology in speech recognition, the Hidden Markov Model. My notes for this topic are included with the readings you were sent -- they have too much mathematics in them to present as a web page. Here, I will just list the general topics covered in those notes and add additional material when we go beyond them to other readings.

The general problem addressed by the HMM is to build a probabilistic model of a sequence of observations. The HMM can be seen as an extension of the simple Gaussian model which combines the probabilities of a sequence of observations into an overall probability score. The first section of the (written) notes extends the simple Gaussian model to cope with sequences and introduces the idea of computing a probability measure for a sequence of observations using a single state HMM.

More complex HMM structures are used when the signal is not
uniform along its length. A two state model might have one state
which models the initial part of a diphthongal vowel and a second
state which models the final part. The transition probabilities
between states control the time spent in each state while accounting
for an utterance. Adding another state means that it is now not
possible to tell which state accounts for which part of the input
signal, it is in this sense that HMMs are
*Hidden*. All of the possible paths through the
model contribute to the overall probability estimate and there is no
sense in which any one path was the `right' path.

Training an HMM is more complex than training a Gaussian Model especially with more than one state since we don't know which state is responsible for which part of the input sequence. The solution is to make a guess at the state sequence and use that in assigning input vectors to states for training purposes.

We now discuss the three algorithms involved in making HMMs work. These are:

*Estimating conditional probability*. Calculates the probability of the input sequence given a model.*Find the best path through the model.*That is, find the path which most closely matches the input sequence. This enables us to assign states to input vectors for training.*Train a model.*Estimates the Gaussian parameters (means and covariances) and transition probabilities to best account for a data set.

At this point we have covered the basic functionality of the Hidden Markov Model. Given a set of input sequences representing one class of speech sound (eg. a phoneme, syllable or word), we can train a model on the data and then estimate the conditional probability of an unknown sequence given the model. However, this is only the beginning of our treatment of HMM based speech recognition. We still need to understand how these low level HMMs are used to recognise conected speech and how they are used as components in a real speech recognition system.

In this session we will look at some of the details of applying the HMM algorithms to speech recognition. These include selecting the size of unit to model and connecting models together to make words and phrases. As you can understand there are many details which can't be covered here; we will just cover some of the simpler issues.

A Hidden Markov Model is a statistical model of a sequence of feature vector observations. In building a recogniser with HMMs we need to decide what sequences will correspond to what models. In the very simplest case, each utterance could be assigned an HMM: for example, one HMM for each digit in a digit recognition task. This follows the pattern of the template based DTW recogniser we discussed earlier. To recognise an utterance, the probability metric according to each model is computed and the model with the best fit to the utterance is chosen. However, this approach is very inflexible and requires that new models be trained if new words are to be added to the recogniser. A more general approach is to assign some kind of sub-word unit to each model and construct word and phrase models from these.

The most obvious sub-word unit is the phoneme. If we assign each phoneme to an HMM we would need around 45 models for English; an additional model is also created for silence and background noise. Using this approach, a model for any word can be constructed by chaining together models for the component phonemes.

Each phoneme model will be made up of a number of states; the number of states per model is another design decision which needs to be made by the system designer. Each state in the model corresponds to some part of the input speech signal; we would like the feature vectors assigned to each state to be as uniform as possible so that the Gaussian model can be accurate. A very common approach is to use three states for each phoneme model; intuitively this corresponds to one state for the transition into the phoneme, one for the middle part and one for the transition out of the phoneme. Similarly the topology of the model must be decided. The three states might be linked in a chain where transitions are only allowed to higher numbered states or to themselves. Alternatively each state might be all linked to all others, the so called ergodic model. These two structures are common but many other combinations are clearly possible.

When phoneme based HMMs are being used, they must be concatenated to construct word or phrase HMMs. For example, an HMM for `cat' can be constructed from the phoneme HMMs for /k/ /a/ and /t/. If each phoneme HMM has three states the `cat' HMM will have nine states. Some words have alternate pronunciations and so their composite models will have a more complex structure to reflect these alternatives. An example might be a model for `lives' which has two alternatives for the pronunciation of 'i'.

While phoneme based models can be used to construct word
models for any word they do not take into account any contextual
variation in phoneme production. One way around this is to use
units larger than phonemes or to use context dependant models. The
most common solution is to use *triphone models*
where there is one distinct phoneme model for every different left
and right phoneme context. Thus there are different models for the
/ai/ in /k-ai-t/ and in /h-ai-t/. Now, a word model is made up
from the appropriate context dependant triphone models: 'cat' would
be made up from the three models [/sil-k-a/ /k-a-t/ /a-t-sil/].

While the use of triphones solves the problem of context
sensitivity it presents another problem. With around 45 phonemes in
English there are 45^{3} = 91125 possible
triphone models to train (although not all of these occur in speech
due to phonotactic constraints). The problem of not having enough
data to effectively train these models becomes very important. One
technique (state tying) will be discussed in the next section but
another is to use only *word internal* triphones
instead of the more general *cross word*
triphones. Cross word triphones capture coarticulation effects
accross word boundaries which can be very important for continuous
speech production. A word internal triphone model uses triphones
only for word internal triples and diphones for word final
phonemes; `cat' would become: [sil /k-a/ /k-a-t/ /a-t/ sil]. This
will clearly be less accurate for continous speech modelling but
the number of models required is smaller (none involve silence as a
context) and so they can be more accurately trained on a given set
of data.

There are many free parameters in an HMM, there are some tricks to reducing the number to allow better training. I'll discuss two of these: state tying and diagonal covariance matrices.

A single HMM contains a number of free parameters whose
values must be determined from training data. For a fully
connected three state model there are nine transition probabilities
plus the parameters (means and covariance matrices) of three
Gaussian models. If we use 24 input parameters (12 MFCCs plus 12
delta MFCCs) then the mean vector has 24 free parameters and the
covariance matrix has 24^{2}=576 free
parameters making 609 in all. Multiply this by 45 phoneme models
and there are 27,405 free parameters to estimate from training
data; using context sensitive models there are many more (around
2.5 billion). With this many free parameters, a very large amount
of training data will be required to get reliable statistical
estimates. In addition, it is unlikely that the training data will
be distributed evenly so that some models in a triphone based
recogniser will recieve only a small number of training tokens
while others will recieve many.

One way of addressing this problem is to share states between triphone models. If the context sensitive triphone models consist of three states (for example) then we might assume that for all /i/ vowel models (/i/ in all contexts) the middle state might be very similar and so can be shared between all models. Similarly, the initial state might be shared between all /i/ models preceeded by fricatives. Sharing states between models means that the Gaussian model associated with the state is trained on the data assigned to that state in both models: more data can be used to train each Gaussian making them more accurate. The limit of this kind of operation is to have all /i/ models share all states, in which case we have reduced the models to context insensitive models again. In practice, states are shared between a significant number of models based on phonetic similarity of the preceding and following contexts.

The covariance matrix associated with each Gaussian inside each state measures both the amount of variance along each dimension in the parameter space and the amount of co-variance between parameters. In two dimensions this co-variance gives us the orientation and 'stretch' of the ellipse shape. One simplification that can be made is to ignore the co-variance part of the matrix and only compute and use the individual dimension variances. In doing this we retain only the diagonal part of the co-variance matrix, setting all of the off-diagonal elements to zero. While this simplification does lose information it means a significant reduction in the number of parameters that need to be estimated. For this reason it has found favour in some HMM implementations.

Choosing to build phoneme or triphone based models means that
to recognise words or phrases we must make composite HMMs from
these subword building blocks. A model for the word
*cat* can be made by joining together phoneme
models for /k/ /a/ and /t/. If each phoneme model has three states
then this composite model has nine states but can be treated just
like any other HMM for the purposes of matching against an input
observation sequence.

To be able to recognise more than one word we need to construct models for each word. Rather than have many separate models it is better to construct a network of phoneme models and have paths through the network indicate the words that are recognised. An example can be seen in Figure 12.1 where the phonemes are arranged in a tree, each leaf of the tree corresponds to a word in the lexicon.

**Figure 12.1. An example lexical tree which links together phoneme models
in a network such that alternate paths through the network
represent different words in the lexicon. Figure taken from
Deshmukh, Ganapathiraju and Picone (IEEE
Signal Processing Magazine, vol. 16, no. 5, September
1999).**

The tree of phoneme models ensures that any path through the network corresponds to a real word. Each open circle in the figure represents a single HMM which might consist of three states. Each solid circle corresponds to a word boundary. Cases of multiple pronunciations (for `A') and homonyms (`hart' and 'heart') can be seen in this network. If triphones models are being used this network would be expanded (an example is shown in the paper which we'll look at in more detail when we talk about language models).

For connected word recognition this network can be extended
to link words together into phrases such that the legal paths
through the network correspond to phrases that we want to
recognise. This begs the question of what phrases are to be
allowed; the answer lies in what is called a *language
model* who's job is to define possible word sequences
either via definite patterns such as grammar rules or via
statistical models. The language model defines the shape of the
network of HMMs; in anything but the simplest cases this network
will be extremely complicated. While it is useful to think of a
static, pre-constructed network being searched by the Viterbi
algorithm, in a real recogniser the network is constructed as it is
searched. We will visit language modelling again in more detail in
a later session.

In the paper by Picone in your reading pack (Continuous Speech Recognition using Hidden Markov Models, IEEE ASSP Magazine, July 1990) there is an outline of the practical steps needed to train an HMM based recogniser from scratch. I'll review these steps here.

Assuming that we have designed our individual phoneme or triphone HMMs then the first step is to initialise the free parameters in the models. These parameters are the transition probabilities between states and the means and covariance matrices of the Gaussian models associated with each state.

The importance of this step cannot be overstated. Remember that our next step will be to begin supervised training where we will force align the models with speech samples and update the model parameters to better fit that segmentation. If we begin with a poor set of parameters (for example, by choosing random values for each parameter) the forced aligment will be unlikely to assign appropriate phonemes to each model and so the model will be unlikely to improve itself. HMM training can be thought of as a search for the lowest point on a hilly landscape; if you begin from a point close to the lowest point you may well find it but if you begin from a point close to a higher dip, you may get stuck there and be unable to see the better solution over the horizon.

The standard way to initialise the Gaussian models in each state is to use a small amount of hand segmented data and align it to each model. In a three state model, each state might be given four vectors of a twelve vector input sequence corresponding to one token. In this way the Gaussians are initialised to approximate the distributions for each phoneme. Transition probabilities are less troublesome and are usually initialised to equal values.

This is the largest part of training the low level HMM parameters. The raw data are speech recordings for which word transcriptions are available. The supervised training procedure looks at each utterance in turn, constructs an HMM corresponding to that utterance from the sub-word constituent models, recognises the utterance with the HMM and finally updates the statistical estimates needed for training. After all of the utterances have been treated in this way, the parameters of the constituent models are updated according to the statistics collected. This process is repeated until the changes to the HMM parameters on each iteration are very small.

As Picone points out there may be further steps to training
an HMM based recogniser. He discusses the use of a discriminant
method which tries to make the HMM better at rejecting the most
confusable near miss errors. Another interesting possibility is
the use of untranscribed speech data for training. As Picone says
"there's no data like more data", some researchers have
experimented with training HMM systems on large amounts of
untranscribed data. Since there is no known correct recognition
result the recogniser must use it's general language model to
recognise each utterance. Statistics are only collected where the
degree of certainty of the match is good. These experiments have
shown that training on large quantities (50+ of hours) of
untranscribed speech can be useful (Kemp & Waibel,
*Unsupervised Training of a Speech Recognizer: Recent
Experiments*, Eurospeech 99).

**Table of Contents**

- 13.1. Neural Networks
- 13.2. Hybrid HMM Recognisers
- 13.3. Resources

Artificial Neural networks (ANN) began life as an attempt at simulating the processing that goes on in the brian. Many simple neurons are connected together and computational tasks are performed via thier interactions. The simple neurons modelled in ANNs are so simple as to render the analogy with real biological neurons nonsense; however the emergent properties of these kinds of computing machines are very interesting in themselves and so ANNs are studied as abstract computing systems rather than as biological models.

A neural network consist of many simple processing units each
of which is connected to many other units. Each unit has a numerical
activation level (analogous to the firing rate of real neurons). The
only computation that an individual unit can do is to compute a new
activation level based on the activations of the units it is
connected to. The connections between units are weighted and the new
activation is usually calculated as a function of the sum of the
weighted inputs from other units (Figure 13.1). Some units in a network are usually
designated as *input units* which means that
their activations are set by the external environment. Other units
are *output units*, their values are set by the
activation within the network and they are read as the result of a
computation. Those units which are neither input nor output units
are called *hidden units*.

**Figure 13.1. A simple neural network. The activation of unit A5 is
calculated as the weighted sum of the four units which are
connected to it. **

One of the major applications of neural networks is in pattern
matching. A simple network can be constructed where a set of input
units are connected directly to a single output unit (the network in
Figure 13.1 could be an example) . If the weights
on the connections are selected appropriately, the output node could
be made to have a high activation if a desired pattern (say all
inputs off) were present and a low activation otherwise. The trick
is clearly to find the combination of weights to achieve this. This
problem is not dissimilar to the problem of training an HMM since
there is no way to calculate these values they must be estimated and
the solution optimised using an iterative procedure. There are many
training procedures for neural networks, a very common one is called
*backpropagation*, all try to minimise the
difference between the desired output and the achieved output by
modifying the weights in the network.

As a small example of how the training algorithm might work, consider trying to modify a network so that two units have high activations together (ie. when A is high, so is B). If we observe A high and B low then we can increase the weight on the link between them so that more activation will flow into B. If we want the reverse, then we should decrease the weight, even making it negative so that A being high will tend to drive down the activation of B. By making small changes in this way the optimisation algorithms are able to find solutions to quite complex input output mappings.

There are some important theoretical limits to the capabilities of neural networks. A network consisting of an input layer directly connected to an output layer can be trained to respond to many kinds of input patterns but there are some important patterns that can't be captured. The simplest example is the exclusive-or problem. We require a network who's output is 1 when one or other of its two inputs are 1 but who's output is zero otherwise. It is not possible to find values for weights which implement this function in a single layer network. The reason is that there are not enough degrees of freedom in the network to partition the input space correctly. The space of this problem can be seen as an x-y plane corresponding to the two inputs (Figure 13.2). For the simpler problems of AND and OR we can draw a single line between the input cases where the output should be zero and those where the output should be 1. For exclusive or (XOR) it isn't possible to draw such a line. The mathematics of the neural network is such that the weights on the two links in this network define the line in the input space. Since there is no way to draw a line in the XOR case there is no combination of weights which will solve the problem.

**Figure 13.2. A three unit neural network and the subdivision of the space
required to solve three simple problems.**

The solution to the XOR problem is to add more variables to
the network. This can be done by adding a *hidden
unit* to the network between the input and the output
(Figure 13.3). This adds four more variables to
the system making six in all (the three new weights plus the
activation of the hidden unit); the system must now find a plane in
this six dimensional space which separates the input
patterns.

The lesson learned here is that there are many patterns which
can't be captured in a single layer neural network and that adding
hidden nodes between input and output adds degrees of freedom to the
network which enable solutions to these problems. The class of
patterns which can be solved by a single layer network are called
*linearly separable* problems because we can draw
a single line between the cases.

The general structure of a neural net then is a set of input nodes who's value comes from the environment (eg. from speech data), a set of hidden nodes and a set of output nodes. The values of the output nodes are set in training and the weights in the network are adjusted to mininmise the error in the system. This optimisation procedure is able to capture many complex mappings between input and output including measures of posterior probability, which is why they are of interest in HMMs.

One of the issues faced when trying to apply neural network approaches to speech signals is the inherent temporal nature of the speech signal. Speech is a signal in time and we know that many of the features that are important in decoding speech have a temporal dimension: for example, the trajectory of a diphthongal vowel. Neural networks as presented above can only take a static input for classification, equivalent to the way that we used gaussian models for classifying spectral slices. Another approach is needed to enable networks to model time effectively.

The most common solution to this problem is exemplified by recurrent neural networks. In this model, the sequence of inputs representing, for example, a parameterised speech signal, is fed into the network one at a time. At each stage the activation of the hidden and output nodes are calculated as usual. The additional feature is that the activation of some of the network is fed back to the inputs at each step. This activation commonly comes from some or all of the hidden units in the network and provides some measure of context sensitivity: the network is reacting not only to the current input but to it's own reaction to the previous input. In this way, networks are able to learn to classify patterns in with significant temporal components.

The standard HMM formalisation makes a number of simplifying assumptions in order to make the approach feasable both mathematically and computationally. In particular:

Sequences are modelled as a first-order markov chain with dependancies only one step back in the sequence.

Within a state, no contextual information is used to determine the emmision probability for a particular vector. That is, the gaussian models only the distribution of vectors at one time point with no reference to what has come before or what follows.

Probability distributions are modelled as single gaussians or mixtures of gaussians.

The HMM formalism is based on estimates of the
*posterior probability* that an acoustic vector
sequence V was produced by the model M (P(M|V)). During training,
values are selected for the free parameters in the model so that
P(M|V) is miximised for the given set of input vectors. This is
called the *Maximum A Posteriori* or MAP
criterion. However, since we use gaussian models which estimate the
conditional probability P(V|M) of a vector sequence given the model
we must use Bayes rule to calculate P(M|V).

P(V|M) is often called the liklihood of the data given the model. P(M) is the prior probability of the model; I've glossed over this term and said that we assume all models to be equally likely, however this term is in fact estimated from the language model which gives a probability measure for any given sequence of phonemes or words. P(V) is the prior probability of the vector sequence and is assumed to be constant in the HMM approach.P(M|V) = P(V|M) * P(M) / P(V)

The use of gaussian models in the HMM formalism mean that instead of using the MAP criterion for training we are using a slightly weaker criterion known as Maximum Liklihood Estimation (MLE). This is because the parameters of the gaussian are always adjusted to maximise P(V|M).

An alternative to this approach is to directly estimate P(M|V)
and maximise this quantity during training. One constraint on this
alternative is that the probabilities for all models for any vector
sequence must sum to 1. If this constraint is to be upheld, the
model parameters must be adjusted to reduce the probability of
inappropriate vectors as well as increase the probability of correct
ones. For this reason, these networks are called
*Discriminant HMMs* since they learn to
discriminate between categories rather than just estimate their
conditional probability.

Our treatment of the mathematics of the standard HMM glossed over many details, some of which are worth covering to understand the significance and the motivation behind Discriminant networks.

One of the elements in calculating the overall probability
of an input sequence given a model is called the *local
contribution* (Bourland & Morgan, p35). This quantity
represents the contribution to the overall probability measure made
by a single state accounting for a single observation. It can be
written as:

Or the probability that state qP( q

_{l}^{n}, x_{n}| q_{k}^{n-1}, X_{1}^{n-1}, M )

The local contribution can be factored into two components:
one representing the state transition and one representing the
emmision of the observation. The equations are given in Bourland &
Morgan as Eq. 3.7. These equations are simplified by two hypotheses
in the standard HMM approach. Firstly, the HMM is assumed to be a
first order markov chain (Bourland & Morgan, H3, p 36) so that the
probability of the model being in state q_{l}
at time n is assumed to depend only on the previous state and not
on the input before time n. This simplifies the first part of the
local contribution to be a single fixed transition probability.
The second assumption is that the observations in the sequence are
independant (Bourland & Morgan, H4) which removes the dependance on
the input history from the second part of the local contribution.
Finally the probability of observing x_{n} is
assumed to depend only on the current state (Bourland & Morgan H5),
giving the final simplified form of the local contribution:

This corresponds to the transition probability multiplied by the state emmision probability. This then is the foundation of the standard HMM approach; these terms are used as the building blocks for computing the overall probability of a vector sequence given a model using the forward-backward algorithm.P( q

_{l}^{n}|q_{k}^{n-1}) * P( x_{n}| q_{l})

In the discriminant HMM model the local contribution is not factored into separate components for transition and emmision probabilities. The probability of a state sequence given an input sequence (which is part of the overall calculation of P(M|V)) is expressed as the product of the local contributions from each state, each of which depends on all previous states and all of the input:

Note that these local contributions (p(qP(q

_{1},...,q_{N}|X) = p(q_{1}|X)*p(q_{2}|X, q_{1})...p(q_{N}|X,q_{1},...,q_{N-1})

In the Hybid recogniser, these local contributions are computed with a recurrent neural network which looks at a window of the input data and feeds back state information from the output to the input. In the model there is just one neural network which has many outputs, one output per state. Typically these outputs correspond to phonemes meaning that the individual phoneme models can be very simple, consisting of a single state with a loopback transition. The output of the neural network at every tick of the clock corresponds to the probability density for each state in the overall model. The constraints on the neural network and in the mathematics of the discriminant HMM mean that the outputs of the neural network sum to one. Figure 13.4 shows an example of the output of a neural network in reaction to the words "one two three".

**Figure 13.4. Individual phone class posterior probabilities as computed
by an MLP for the words one two three. Time
follows the horizontal axis and the duration of the utterance is
1.5 seconds. Taken from Robinson, Cook, Ellis, Fosler-Lussier,
Renals and Williams Connectionist Speech Recognition of
Broadcast News, Speech Communication, 2000.**

The neural network calculates the local contribution for each state in the model at each time point in the input. If you recall the viterbi algorithm this is exactly what is needed to calculate the various quantites used in finding the best path through a network. The discriminant HMM framework does require some changes to the details of the various algorithms (which are discussed in the Bourland and Morgan chapters) but in essence they are similar.

Hybrid HMM/ANN speech recognition is an active research area and there are a number of overview publications which are useful in gaining an understanding of the topic. I will distribute copies of two chapters from:

The two chapters outline the traditional HMM approach and the Hybrid approach. This book is quite technical but does provide a good discussion of the whole field. The remaining chapters discuss neural networks in detail and put everything together into a complete hybrid recognition system.Bourlard & Morgan,

Connectionist Speech Recognition: A Hybrid Approachpublished in 1994 by Kluwer.

The following web pages might also be useful:

**Table of Contents**

- 14.1. The Job of a Language Model
- 14.2. LM Driven Search in ASR

In our treatment of HMM speech recognition so far we have dealt
with the low level matching of acoustic signals against models. I
have noted that for word and sentence recognition, sub-word models
need to be concatenated into larger composite models. The way that
word models are connected together into phrases is governed by a
knowledge of the phrases that are to be accepted by the recogniser;
it is the job of the *language model* to define
this set of phrases or at least define the liklihood of one word
following another.

In the Bayesian decision rule which we use to decide on the interpretation of an acoustic signal:

the P(M) component measures the probability of the composite HMM representing an utterance. This component is independant of the input acoustic signal and is estimated from a knowledge of the probability of different word strings in the target language.P(M|V) = P(V|M) * P(M) / P(V)

In practice, the language model is used to guide the search for an interpretation of the acoustic input. When word models are pasted together to form phrase models the language model is used to define which words follow at each point in the model and the transition probability from one word to the next. For large vocabulary recognisers, this word lattice isn't constructed beforehand, it is built as the search progresses and the language model is used to determine the overall probability of the search paths under consideration.

The main product of a language model is an estimate of the probability of a word given the word sequence that has already been observed:

This quantity can be used in the viterbi search algorithm to update the estimate of a path probability based on the knowledge of which words are part of that path. Clearly estimating this probability is impossible since the word sequence can be arbitrarily long; language models differ in the assumptions that they make to simplify this quantity.P(w

_{n}|w_{1},w_{2},w_{3}...w_{n-1})

The simplest case of a language model occurs when the range of sentences to be recognised is very small and can be captured by a deterministic grammar. There are not many situations where this is appropriate. One is in a limited domain dialog system such as those implemented by the CSLU Toolkit. In this system a small grammar is given for each point in a human-computer dialog, a simple example is the grammar for digit strings:

$digit = one | two | three | four | five | six | seven | eight | nine | zero | oh ; $digits = [*sil%% | *any%%] *$digit [*sil%% | *any%%];This is a very unconstrained grammar, any word can follow any other word. The special words

In the deterministic language model, the probability of a word given it's predecessors is either zero or some constant probability estimate.

An alternitive to the deterministic grammar is to attach some measure of probability to each of the grammar rules. This might be derived from data or assigned by the grammar writer based on thier intuition. With this information it is now possible to estimate the probability of any word string allowed by the grammar. This estimate can then be included in the search criteria of the HMM.

While grammar based language models are easy to understand,
they are not generally useful for large vocabulary applications
simply because it is so difficult to write a grammar with
sufficient coverage of the language. The most common kind of
language model in use today is based on estimates of word string
probabilities from large collections of text or transcribed speech.
In order to make these estimates tractable, the probability of a
word given the preceeding sequence is approximated to the
probability given the preceeding one (bigram) or two (trigram)
words (in general, these are called *n-gram*
models). For a bigram language model:

P(w

_{n}|w_{1},w_{2},w_{3}...w_{n-1}) = P(w_{n}|w_{n-1})

To estimate the bigram probabilities from a text we must
count the number of occurences of the word pair
(w_{n-1}, w_{n}) and divide
that by the number of occurences of the preceeding word
w_{n-1}. This is a relatively easy computation
and accurate estimates can be obtained from transcriptions of
language similar to that expected as input to the system. For
example, if we are recognising new stories, text such as the Wall
Street Journal corpus can be used to estimate bigram probabilities
for the language model. This model is unlikely to transfer very
well to another domain such as train timetable enquiries; in
general each application requires the language model to be fine
tuned to the language input expected.

The bigram language model gives the simplest measure of word
transition probability but ignores most of the preceeding context.
It is easy to come up with examples of word sequences which will be
inproperly estimated from a bigram model (for example, in "The dog
on the hill *barked*", the probability of
*barked* following *hill* is
likely to be underestimated). The more context a language model
can use the more likely it is to be able to capture longer distance
dependancies between words in the input. Unfortunately we hit a
severe problem with making estimates of probabilities for anything
beyond trigram language models, and even there special care must be
taken to deal with word sequences where data is scarce. In a
trigram language model the probability of a word given it's
predecessors is estimated by the probability given the previous two
words:

To estimate this quantity we must count the number of times the triple of (wP(w

_{n}|w_{1},w_{2},w_{3}...w_{n-1}) = P(w_{n}|w_{n-2},w_{n-1}n)

To overcome this paucity of data the technique of language
model *smoothing* is used. Here the overall
trigram probability is derived by interpolating trigram, bigram
and unigram probabilities:

Where the functions f() are the unsmoothed estimates of trigram, bigram and unigram probabilities. This means that for a triple which does not occur in the training text, the estimated probability will be derived from the bigram model and the unigram model; the estimate will be non-zero for every word included in the lexicon (ie. every word for which there is an estimate of P(w)). The choice of the parameters k1..k3 is another optimisation problem.P(w

_{n}|w_{n-2},w_{n-1}) = k1*f(w_{n}|w_{n-2},w_{n-1}) + k2*f(w_{n}|w_{n-1}) + k3*f(w_{n})

As mentioned above, the job of the language model in ASR is to predict which words might come next in sequence as we decode the speech signal. This prediction might be in the form of a list of words from a grammar based model or a list of words and probabilities from an N-gram based model. In either case, this information must be used in the recogniser to guide the search for an interpretation of the input signal.

We have discussed (Section 12.2.3) how
continuous speech is recognised by constructing a composite HMM
corresponding to the legal utterances. The input signal is then
force aligned with this HMM using the Viterbi algorithm and the best
path through the network is taken to indicate the most likely word
sequence corresponding to the input. While it might be feasable to
construct this network explicitly in some simple cases, it is likely
to get unmanageably large very quickly. For example, to recognise a
sequence of six digits there are a total of
11^{6} possible paths through the network
(0-9 plus `oh' repeated six times).

In order to implement anything like a real time speech
recogniser some shortcuts must be taken in constructing and
searching this space of alternatives. Earlier, we introduced the
Viterbi algorithm which used dynamic programming techniques to make
the search for a best path feasable in an HMM. Viterbi makes a
simplifying assumption (that the best path
*through* any node must include the best path
*up to* that node) which means that although it
finds an optimal path there might still be a better one. In doing
this though, it still evaluates every node in the network. Practical
search methods will need to remove many of the paths through the
network from consideration in order to find a good path in a
reasonable time; the result is that the best path might not be found
at all.

A useful reference here is a paper on Hierarchical Search for Large Vocabulary Conversational Speech Recognition by Deshmukh and others. This paper describes the search methods used in the ISIP recogniser in detail but begins with a useful summary of the problems of search in ASR and some of the common algorithms that are used.

The basic Viterbi search algorithm can be improved upon by removing some of the paths from consideration when evaluating the network. Viterbi is an example of a breadth-first search algorithm; all paths through the network are expanded at the same time (the alternative is depth-first search where one path is followed to completion before the next is evaluated). So, if there are three possible start words predicted by the language model, there will be three initial paths in the network; these three paths will then be extended into, say, nine paths of length two and so on. At each step, the Viterbi algorithm evaluates the cost of the best path to each leaf node before extending the paths again.

An extension of this algorithm is known as a *beam
search*; here some paths are removed from consideration
at each iteration of the algorithm. Instead of extending all paths
the algorithm only extends the best paths. The paths are rated on
their total probability and only those within some fixed distance
of the best path are considered for further growth.

An alternative to the breadth first Viterbi algorithm is a depth first method known as stack decoding. Figure 14.1 from the Deshmukh paper outlines the steps in the algorithm.

**Figure 14.1. Simple overview of the stack decoding algorithm from Deshmukh
et al.**

This algorithm extends only the single best path from the current set of paths under consideration until the end point has been reached. This requires being able to compare paths of different lengths which must be done carefully since the probabilities being compared will get smaller for longer paths. This problem is addressed by normalising path costs based on the number of feature vectors it accounts for.

Multi-pass search algorithms work by first using a low cost method to find promising paths through the network and then following this with a more detailed search of these paths. This is illustrated in Figure 14.2 taken from the Deshmukh paper.

**Figure 14.2. An example of the N-best list of hypotheses generated for a
simple utterance, and the resulting word graph with N equal to
20. Note that most of the paths are almost equally probable, and
only minor variants of each other in terms of segmentation. This
indicates the severity of the acoustic confusability in
spontaneous, conversational speech recognition (from Deshmukh
et al).**

Although I've talked as if the HMM network being searched consists of words we know that in a real application the word models will themselves be made up of phoneme based HMMs. This adds an extra level of complexity to the construction of the network. For each word in our lexicon we will have one or more pronunciations which must be inserted into the composite HMM.

In some ways this can simplify the network since we know that there are only around 50 unique phonemes that can start our utterance even though there may be 1000s of unique words. If we are clever in constructing the network, equivalent nodes can be unified so that the overall network becomes much simpler. This is illustrated in Figure 12.1 shown in an earlier chapter. In this model the network branches to the right only where different words diverge in their phonetic transcription. The endpoint of one of these branches defines a word that can be identified.

When the base level of HMMs are context sensitive triphones then the picture gets more complicated again since we can't merge quite as many nodes as in the phoneme based example above. The meat of the Deshmukh paper referenced earlier deals with defining efficient methods for constructing and searching such complex networks.

Although we've considered the HMM algorithms independantly, in a real recogniser the decoding process is intimately tied up with the use of a language model which makes predictions about word sequences. This search process is by far the most important and time consuming part of large vocabulary speech recognition and is where many of the current advances are being made in the field.

**Table of Contents**

- 15.1. The Problem
- 15.2. System Outline
- 15.3. Resources

The speaker identification and verification problem is to verify an individual based on a sample of speech. In speaker verification we are given a claim of identity and must verify that the claim is true; in speaker identification we must choose which speaker from a closed set produced the speech sample. These problems are obviously related but have different decision criteria and different requirements for speaker modelling.

We have already covered all of the technical prerequisites for speaker identification/verification. You may be able to work out that the basic process is to build a model of the speech of each individual using some statistical technique and match incoming speech samples with these models. The questions are only what form the models take and how the decisions are made.

One question to address is what speech is to be used to
identify an individual. *Text dependant*
identification uses a known utterance, for example a password or
combination digit sequence, which is predefined for each
speaker. *Text independant* identification uses a
different response for each identification attempt. In this latter
case, the input could be the answer to a question or might just be
some speech sampled from a conversation with the user. Text
dependant identification is a simpler problem as the speaker model
can be tuned to the phonetic content of the input text. Where any
input is allowed, a more general model is required since the
phonetic content can vary significantly.

The structure of a speaker recogniser is similar to that of a speech recogniser in that the flow of control is divided into parameterisation, pattern matching and decision making.

In speech recognition we were concerned with retaining the parts of the speech signal that conveyed information about the phonetic content of the signal. Information about the source was therefore removed since it contributes an independent information steam (pitch). In speaker recognition we wish to retain information about the speaker's identity; as it turns out the requirements are almost identical to those for speech recognition. We would like to ignore variations such as different pitch, speaking rate, environment or communication channel in the same way as we do for ASR. The end result is that the parameterisation used for speaker recognition is the same as that for ASR, typically Mel frequency cepstral coefficients (MFCC) (Reynolds, D. A., Experimental Evaluation of Features for Robust Speaker Identification, IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 3, pp. 639-643, October 1994.). As for robust speech recognition applications, compensation for noise and variations in communication channel become very important for telephone based applications.

In speech recognition, we build models for the words or phonemes that we wanted to recognise. In speaker recognition, we need a model of the acoustic properties of the speaker's voice. Our requirement is the same in each case, we need to be able to compare some unknown speech with the model to produce a distance or probability measure for that speaker.

While the temporal nature of speech signals is vitally important in speech recognition, it is not necessarily so important in a speaker model. A simple and effective speaker model can be made by building a probabilistic model of the distribution of input vectors for a speaker. For example, we might build a Gaussian model from the MFCC vectors corresponding to 10 seconds of each speakers speech. A simple Gaussian model is unlikely to be appropriate since the MFCC data is unlikely to follow a simple normal distribution; more appropriate is a mixture of Gaussians as commonly used within HMM states or even a neural network based model. The distance score for a section of input speech is then computed from the product of the probability densities for each input vector.

Earlier speaker verification systems used a vector quantisation codebook as the speaker model. A codebook was built for each speaker and the distance to an unknown sample was estimated from the VQ distortion for each codebook: that is, the input was encoded with each codebook and the difference between the original speech and the codebook cluster centers was measured. This can be seen as a simple form of the mixture of Gaussian model which encodes each speaker as a set of cluster centers, ignoring the covariance for each cluster.

If the application is closed set speaker identification then the decision making process is similar to that in speech recognition: we select the candidate model with the smallest distance or largest probability measure. In speaker verification applications however, the decision as to whether to accept the claim of identity is more complicated since this is an open set problem.

The simplest solution to the verification problem is to set a threshold distance and accept the claim if the distance to the claimed model is below the threshold. If this is done, the security of the system can be varied by changing the threshold. In a secure system, the threshold is set very low which should result in fewer false acceptance (FA) errors and more false rejections (FR). In a less secure but more convenient system, the threshold is set higher resulting in more FA errors and fewer FR errors. This is illustrated in the DET curve (Figure 15.1) which plots the false accept rate vs. the false reject rate for different values of the threshold. A point on the error curve where the two error rates are equal is called the Equal Error Rate (EER) and is often quoted as an overall measure of a system's performance.

**Figure 15.1. Example DET Curve taken from Reynolds
and Heck**

The simple threshold solution does not take into account the place of one speaker model in the population. It might be the case that a speaker has many similar speakers, in this case a high threshold might accept the other known speakers as well as the target speaker. One common option is to build a cohort of similar speakers for each known speaker and measure the distance from the members of the cohort as well as the claimed speaker model. To be accepted into this kind of system the input must be closer to the claimed model than it is to any cohort member by some proportion. In this way the threshold is adjusted based on the occupancy of the speaker space: where the space is crowded the threshold is tightened, where it is sparse the threshold can be relaxed.

There are a number of papers available on the web which provide a useful overview of the speaker identification problem.

The proceedings of a Symposium on Humans, Computers and Speech includes slides from a talk by Doug Reynolds and Larry Heck on Automatic Speaker Recognition. This covers the major points although it's more of an executive summary than an academic piece.

Another useful resource is a special issue of the journal
Digital Signal Processing (Vol. 10,
No. 1/2/3, January/April/July 2000) which covers the 1999
NIST evaluation of speaker verification technology. This can be
accessed from Macquarie via the library web page (the link
above). Distance students can get access to some of the papers in
the directory
`ftp://user@www.shlrc.mq.edu.au/home/masters/papers`
where `user` is your web server login name. I
can't put the papers up for more general access because of copyright
issues. If you'd like to get access to a paper that I haven't copied
over, please let me know and I will do so. The Martin paper in
particular is interesting as it gives an overview of the performance
of the different systems in the NIST evaluation.