SST 2002

Tutorial Day
Monday 2nd December 2002
Physcis Building, The University of Melbourne
SST-2002
The
Ninth Australian International Conference on Speech Science and Technology
2nd-5th December 2002 at the University of Melbourne
To download the Tutorial Day registration form, please click here
Tutorial Presentations
For a PDF version of the Tutorial Day Program, please click here
| Stream A | Stream B | |
| 9.30am - 11.30am | Helen
Fraser Representing speech in theory and practice |
Richard
Cox Spoken Natural language technology: applications and challenges |
| 12.30pm - 2.30pm | Steven
Bird Annotation graphs: theory and applications |
Rob
Brennan Real-time speech processing applications |
| 3.00pm - 5.00pm |
Marija
Tabain |
Chris
Davis Audio-Visual speech processing |
Steven
Bird
University of Melbourne
Annotation
graphs: theory and applications
Annotated corpora have
been a critical component of research in the speech and language sciences for
some years. Today, these corpora are being created and deployed for a rapidly
expanding set of languages, disciplines and technologies. A wealth of formats
and tools have sprung up around this enterprise, many of which are documented
on the Linguistic Annotation page: [http://www.ldc.upenn.edu/annotation/].
"Linguistic annotation" is a term which covers any descriptive or analytic notations
applied to raw language data. The basic data may be in the form of time functions
- audio, video and/or physiological recordings - or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
"named entity" identification, co-reference annotation, and so on.
This tutorial will focus on a model of linguistic annotation which provides a simple framework for representing and manipulating complex, heterogeneous, multi-layered annotations. The model uses directed acyclic graphs having labels on the edges and time-offsets on the nodes, so-called annotation graphs. The tutorial will cover the formalism, the software infrastructure, and practical applications. Participants will learn about the steps involved in building their own special-purpose annotation tools.
As we create new language resources, such as annotated corpora and the associated annotation software, there needs to be a standard way to describe them so that they can be found and re-used by others. The tutorial will cover a new framework for sharing language resources, the Open Language Archives Community [http://www.language-archives.org/].
Topics:
Richard
V. Cox
AT&T Labs - Research
Spoken
Natural Language Technology, Applications and Challenges
It is 2002 and no one yet speaks to a HAL 9000-like computer in the same ways that the astronauts did in 2001 - a Space Odyssey. Significant progress has been made in many areas. We are beginning to see the first spoken natural language interfaces for commercial products and services. These interfaces can be purely spoken, e.g. for telephony, or can be multimodal, e.g. for computers or handheld devices. The goals of this tutorial are:
Helen
Fraser
Senior Lecturer in Linguistics, University of New England, Australia
Representing Speech in Theory and Practice
In this tutorial we will look at the question of how best to represent speech, whether 'externally', in transcription, orthography, diagrams or verbal descriptions, or 'internally', in models of perception and production.
Of course we will not come up with one single good-for-everthing solution to the problem of how to represent speech! Many representations are available to us, and all of them have value when used appropriately. Rather we will accept the fact that different contexts, purposes and audiences require different types of representation. This will allow us to consider some principles according to which the most appropriate type for a particular context can be chosen, and then used consistently.
The focus will be fairly practical, as we consider how to apply the principles in a range of phonetic applications (bring your own issues to discuss with the group), but we will also consider implications of the principles for phonetic and phonological theory.
Marija
Tabain
Speech, Hearing and Language Research Centre, & Macquarie Centre for Cognitive
Science Division of Linguistics and Psychology, Macquarie University, Australia
Articulatory phonetics
An introduction to electropalatography and electromagnetic
articulography
An introduction to electropalatography and electromagnetic articulography Marija Tabain Macquarie University This tutorial provides an introduction to two of the most commonly used tools in articulatory phonetics: electropalatography (EPG) and electromagnetic midsagittal articulography (EMMA). These two techniques provide very different views of supralaryngeal articulation: EPG reports contact between the tongue and the entire hard palate, while EMMA returns movement trajectories for selected points on the tongue, lips or jaw in the midsagittal plane. Emphasis in this tutorial will be on the advantages and limits of each technique; the sorts of studies for which each technique is best suited; and on ways of quantifying the data that are returned by each system. An overview of how each system works will be provided, as well as examples of returned data (with their concomitant problems!). This tutorial will be aimed at people who are considering collecting EPG or EMMA data themselves, and at people who would like to be able to better interpret EPG and EMMA data when it is presented in talks or papers.
Real-time speech processing applications
Filterbank (multi-rate) analysis and synthesis strategies prove advantageous in many signal processing areas operating as a divide and conquer strategy tackling difficult problems into an equivalent series of much simpler problems. For example, large convolutional systems encountered in applications such as echo cancellation and feedback cancellation may require a large number of filter taps. Using the filterbank technique, it may equivalently be implemented as a parallel combination of much shorter subband filters. When properly designed, the filterbank subband signals are minimally overlapping in frequency yielding signals that are approximately orthogonal to each other. Lately, digital filterbank techniques, with their great precision, have enabled many strategies to be implemented that were difficult or impractical with analog structures. Accordingly, much theory has been developed including the so-called perfect reconstruction filterbank.
An oversampled DFT filterbank using WOLA (weighted overlap-add) processing provides an extremely efficient and elegant solution. This tutorial will describe this filterbank within a dedicated ASIC and algorithmic procedures for casting many algorithms into a multi-rate framework.
Numerous advantages are obtained using the multi-rate framework:
1) Adaptive filtering techniques typified by the LMS algorithm are greatly affected by the eigenvalue spread problem. In short, the LMS algorithm stalls when the input signal possesses a large ratio between the maximum and the minimum eigenvalue. This happens when the signal is distinctly non-white including many useful signal classes such as speech. The subband approach significantly reduces the coloration by representing the original spectrum as a parallel combination of much whiter subband signals. The original coloration is largely captured by the inherent scaling of the subbands.
2) Subbands may be adapted separately. This is a result of the orthogonality. This proves to be an advantage when processing power is limited. This tradeoff does not exist in the fullband LMS system where all taps must be adapted all the time.
3) Enhanced Tracking ability: Each subband may be adapted with separate convergence factors. This is useful in applications where narrowband disturbances exist. Only the subbands affected need be adapted helping to concentrate resources in these subbands and/or reducing power consumption.
4) Filtering complexity reduction: The filtering operation complexity is greatly reduced by converting intensive time-domain convolutions to relatively short frequency-domain convolutions. In certain cases, the filtering in each parallel path may be reduced to multiplication by a single (possibly) complex value.
As mentioned previously, filterbanks are used in many important applications; many more than can be listed here. Typical applications are:
1) Coding applications In the encoder, the input signal is passed through an analysis filterbank after which each subband is quantized (coded) with a precision dependant on a psycho-acoustic model. This model is selected to code only the perceptually significant portions of the input signal to reduce the overall bit-rate. The synthesis filterbank in the decoder then reproduces an approximation of the input signal by means of this coded digital stream. The combination of the analysis filterbank (encoder) and the synthesis filterbank (decoder) may be designed to possess the perfect reconstruction property or the approximate reconstruction property (pseudo-QMF) depending on fidelity and delay requirements. In this application, the quantization noise may be roughly classified as additive subband noise. The synthesis filterbank performs double-duty by synthesizing the output signal while rejecting any generated quantization noise which is out-of-band.
2) Adaptive filtering applications Here, the filterbank is not intended to reconstruct the input signal directly but after modifications have been made to the analysis signal. Typically, the filterbank is being invoked to model a desired system (as in hearing aid applications) or to model an undesired or disturbing system in such a manner that the original disturbance may be cancelled (as in echo cancellation systems). These modifications may be scalar real multiplications as in hearing aid applications or may be scalar complex multiplications or vector complex multiplications in the case of echo cancellation. Since the modifications are multiplicative, different criteria for creating and using filterbanks in these applications must be developed as compared to coding applications.
The WOLA filterbank structure is highly configurable and best performance is of course only achieved with an understanding of the optimizations and tradeoffs that can be made within its structure. This tutorial will describe how these optimizations should be made for typical applications.
Chris
Davis
Department of Psychology,
School of Behavioural Science, The University of Melbourne, Australia
Audio-Visual
speech processing
This tutorial will provide an overview and introduction to the area of Audio-Visual
(AV) speech processing. It will be suggested that understanding the range of
AV speech phenomena calls for a multidisciplinary approach; one that provides
the basis for more dynamic conceptions of speech properties. The emphasis of
the tutorial will be on showcasing the multimodal characteristics of speech
from a variety of perspectives. These will range from consideration of the physical
aspects of visible speech (outlining facial development and models of facial
structure and motion) to reviews of recent studies of AV speech perception.
Mention will also be made of the potential for AV applications in human-machine
interaction and in ASR with several different approaches being considered.
Tutorial Day Registration
AUD$40.00 per session
For a pdf version of the Campus Map, please click here
Conference Secretariat
Bronwen Hewitt
Conference Management
The University of Melbourne
Victoria, Australia, 3010
Phone: +61-3-8344-6389
Facsimile: +61-3-8344-6122
bhewitt@unimelb.edu.au