SST 2002

Tutorial Day

Monday 2nd December 2002

Physcis Building, The University of Melbourne

SST-2002

The Ninth Australian International Conference on Speech Science and Technology
2nd-5th December 2002 at the University of Melbourne

To download the Tutorial Day registration form, please click here

Tutorial Presentations

For a PDF version of the Tutorial Day Program, please click here

Stream A Stream B
9.30am - 11.30am Helen Fraser
Representing speech in theory and practice
Richard Cox
Spoken Natural language technology: applications and challenges
12.30pm - 2.30pm Steven Bird
Annotation graphs: theory and applications
Rob Brennan
Real-time speech processing applications
3.00pm - 5.00pm

Marija Tabain
An introduction to electropalatography and electromagnetic articulography

Chris Davis
Audio-Visual speech processing

 


Steven Bird
University of Melbourne

Annotation graphs: theory and applications

Annotated corpora have been a critical component of research in the speech and language sciences for some years. Today, these corpora are being created and deployed for a rapidly expanding set of languages, disciplines and technologies. A wealth of formats and tools have sprung up around this enterprise, many of which are documented on the Linguistic Annotation page: [http://www.ldc.upenn.edu/annotation/].
"Linguistic annotation" is a term which covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, "named entity" identification, co-reference annotation, and so on.

This tutorial will focus on a model of linguistic annotation which provides a simple framework for representing and manipulating complex, heterogeneous, multi-layered annotations. The model uses directed acyclic graphs having labels on the edges and time-offsets on the nodes, so-called annotation graphs. The tutorial will cover the formalism, the software infrastructure, and practical applications. Participants will learn about the steps involved in building their own special-purpose annotation tools.

As we create new language resources, such as annotated corpora and the associated annotation software, there needs to be a standard way to describe them so that they can be found and re-used by others. The tutorial will cover a new framework for sharing language resources, the Open Language Archives Community [http://www.language-archives.org/].

Topics:

TOP


Richard V. Cox
AT&T Labs - Research

Spoken Natural Language Technology, Applications and Challenges

It is 2002 and no one yet speaks to a HAL 9000-like computer in the same ways that the astronauts did in 2001 - a Space Odyssey. Significant progress has been made in many areas. We are beginning to see the first spoken natural language interfaces for commercial products and services. These interfaces can be purely spoken, e.g. for telephony, or can be multimodal, e.g. for computers or handheld devices. The goals of this tutorial are:

  1. Provide researchers and developers a perspective of the speech processing technologies that compose voice-enabled and multimodal applications. These technologies include automatic speech recognition, text-to-speech synthesis, natural language understanding, dialogue management, and natural language generation for purely voice-enabled services. For multimodal services, additional technologies include simultaneous spoken and pen-based input (gestures and handwriting recognition) and visual text-to-speech synthesis for output.
  2. Review the applications and services that already exist today, either in commercial deployment or in laboratory prototypes. Primarily, these examples depend heavily on the context of the application or service, thus lessening the complexity of the task by restricting the domain.
  3. Address the technical challenges that must next be overcome. These include improvements in the component technologies such as greater robustness in speech recognition and greater naturalness in text-to-speech synthesis, and also include challenges in how to scale the expertise needed to build hundreds and thousands of spoken natural language interfaces for a multitude of businesses.
  4. Provide a vision of what will be possible in the next five years The tutorial will include audio and video examples and demonstrations to illustrate the current state of the art.

TOP


Helen Fraser
Senior Lecturer in Linguistics, University of New England, Australia

Representing Speech in Theory and Practice

In this tutorial we will look at the question of how best to represent speech, whether 'externally', in transcription, orthography, diagrams or verbal descriptions, or 'internally', in models of perception and production.

Of course we will not come up with one single good-for-everthing solution to the problem of how to represent speech! Many representations are available to us, and all of them have value when used appropriately. Rather we will accept the fact that different contexts, purposes and audiences require different types of representation. This will allow us to consider some principles according to which the most appropriate type for a particular context can be chosen, and then used consistently.

The focus will be fairly practical, as we consider how to apply the principles in a range of phonetic applications (bring your own issues to discuss with the group), but we will also consider implications of the principles for phonetic and phonological theory.

TOP


Marija Tabain
Speech, Hearing and Language Research Centre, & Macquarie Centre for Cognitive Science Division of Linguistics and Psychology, Macquarie University, Australia

Articulatory phonetics
An introduction to electropalatography and electromagnetic articulography

An introduction to electropalatography and electromagnetic articulography Marija Tabain Macquarie University This tutorial provides an introduction to two of the most commonly used tools in articulatory phonetics: electropalatography (EPG) and electromagnetic midsagittal articulography (EMMA). These two techniques provide very different views of supralaryngeal articulation: EPG reports contact between the tongue and the entire hard palate, while EMMA returns movement trajectories for selected points on the tongue, lips or jaw in the midsagittal plane. Emphasis in this tutorial will be on the advantages and limits of each technique; the sorts of studies for which each technique is best suited; and on ways of quantifying the data that are returned by each system. An overview of how each system works will be provided, as well as examples of returned data (with their concomitant problems!). This tutorial will be aimed at people who are considering collecting EPG or EMMA data themselves, and at people who would like to be able to better interpret EPG and EMMA data when it is presented in talks or papers.

TOP


Rob Brennan

Real-time speech processing applications

Multi-rate Analysis and Synthesis Systems incorporating Modifications using a Highly Configurable WOLA Coprocessor

Filterbank (multi-rate) analysis and synthesis strategies prove advantageous in many signal processing areas operating as a divide and conquer strategy tackling difficult problems into an equivalent series of much simpler problems. For example, large convolutional systems encountered in applications such as echo cancellation and feedback cancellation may require a large number of filter taps. Using the filterbank technique, it may equivalently be implemented as a parallel combination of much shorter subband filters. When properly designed, the filterbank subband signals are minimally overlapping in frequency yielding signals that are approximately orthogonal to each other. Lately, digital filterbank techniques, with their great precision, have enabled many strategies to be implemented that were difficult or impractical with analog structures. Accordingly, much theory has been developed including the so-called perfect reconstruction filterbank.

An oversampled DFT filterbank using WOLA (weighted overlap-add) processing provides an extremely efficient and elegant solution. This tutorial will describe this filterbank within a dedicated ASIC and algorithmic procedures for casting many algorithms into a multi-rate framework.

Numerous advantages are obtained using the multi-rate framework:

1) Adaptive filtering techniques typified by the LMS algorithm are greatly affected by the eigenvalue spread problem. In short, the LMS algorithm stalls when the input signal possesses a large ratio between the maximum and the minimum eigenvalue. This happens when the signal is distinctly non-white including many useful signal classes such as speech. The subband approach significantly reduces the coloration by representing the original spectrum as a parallel combination of much whiter subband signals. The original coloration is largely captured by the inherent scaling of the subbands.

2) Subbands may be adapted separately. This is a result of the orthogonality. This proves to be an advantage when processing power is limited. This tradeoff does not exist in the fullband LMS system where all taps must be adapted all the time.

3) Enhanced Tracking ability: Each subband may be adapted with separate convergence factors. This is useful in applications where narrowband disturbances exist. Only the subbands affected need be adapted helping to concentrate resources in these subbands and/or reducing power consumption.

4) Filtering complexity reduction: The filtering operation complexity is greatly reduced by converting intensive time-domain convolutions to relatively short frequency-domain convolutions. In certain cases, the filtering in each parallel path may be reduced to multiplication by a single (possibly) complex value.

As mentioned previously, filterbanks are used in many important applications; many more than can be listed here. Typical applications are:

1) Coding applications In the encoder, the input signal is passed through an analysis filterbank after which each subband is quantized (coded) with a precision dependant on a psycho-acoustic model. This model is selected to code only the perceptually significant portions of the input signal to reduce the overall bit-rate. The synthesis filterbank in the decoder then reproduces an approximation of the input signal by means of this coded digital stream. The combination of the analysis filterbank (encoder) and the synthesis filterbank (decoder) may be designed to possess the perfect reconstruction property or the approximate reconstruction property (pseudo-QMF) depending on fidelity and delay requirements. In this application, the quantization noise may be roughly classified as additive subband noise. The synthesis filterbank performs double-duty by synthesizing the output signal while rejecting any generated quantization noise which is out-of-band.

2) Adaptive filtering applications Here, the filterbank is not intended to reconstruct the input signal directly but after modifications have been made to the analysis signal. Typically, the filterbank is being invoked to model a desired system (as in hearing aid applications) or to model an undesired or disturbing system in such a manner that the original disturbance may be cancelled (as in echo cancellation systems). These modifications may be scalar real multiplications as in hearing aid applications or may be scalar complex multiplications or vector complex multiplications in the case of echo cancellation. Since the modifications are multiplicative, different criteria for creating and using filterbanks in these applications must be developed as compared to coding applications.

The WOLA filterbank structure is highly configurable and best performance is of course only achieved with an understanding of the optimizations and tradeoffs that can be made within its structure. This tutorial will describe how these optimizations should be made for typical applications.

TOP


Chris Davis
Department of Psychology, School of Behavioural Science, The University of Melbourne, Australia

Audio-Visual speech processing
This tutorial will provide an overview and introduction to the area of Audio-Visual (AV) speech processing. It will be suggested that understanding the range of AV speech phenomena calls for a multidisciplinary approach; one that provides the basis for more dynamic conceptions of speech properties. The emphasis of the tutorial will be on showcasing the multimodal characteristics of speech from a variety of perspectives. These will range from consideration of the physical aspects of visible speech (outlining facial development and models of facial structure and motion) to reviews of recent studies of AV speech perception. Mention will also be made of the potential for AV applications in human-machine interaction and in ASR with several different approaches being considered.

TOP


 

Tutorial Day Registration

AUD$40.00 per session

For a pdf version of the Campus Map, please click here

TOP

Conference Secretariat

Bronwen Hewitt
C
onference Management
The University of Melbourne
Victoria, Australia, 3010
Phone: +61-3-8344-6389
Facsimile: +61-3-8344-6122
bhewitt@unimelb.edu.au