New version page

The MATE Markup Framework

This preview shows page 1-2-3 out of 10 pages.

View Full Document
View Full Document

End of preview. Want to read all 10 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

The MATE Markup Framework Laila DYBKJÆR and Niels Ole BERNSEN Natural Interactive Systems Laboratory, University of Southern Denmark Science Park 10, 5230 Odense M, Denmark [email protected], [email protected] Abstract Since early 1998, the European Telematics project MATE has worked towards facilitating re-use of annotated spoken language data, addressing theoretical issues and implementing practical solutions which could serve as standards in the field. The resulting MATE Workbench for corpus annotation is now available as licensed open source software. This paper describes the MATE markup framework which bridges between the theoretical and the practical activities of MATE and is proposed as a standard for the definition and representation of markup for spoken dialogue corpora. We also present early experience from use of the framework. 1. Introduction Spoken language engineering products proliferate in the market, commercial and research applications constantly increasing in variety and sophistication. These developments generate a growing need for tools and standards which can help improve the quality and efficiency of product development and evaluation. In the case of spoken language dialogue systems (SLDSs), for instance, the need is obvious for standards and standard-based tools for spoken dialogue corpus annotation and automatic information extraction. Information extraction from annotated corpora is used in SLDSs engineering for many different purposes. For several years, annotated speech corpora have been used to train and test speech recognisers. More recently, corpus-based approaches are being applied regularly to other levels of processing, such as syntax and dialogue. For instance, annotated corpora can be used to construct lexicons and grammars or train a grammar to acquire preferences for frequently used rules. Similarly, programs for dialogue act recognition and prediction tend to be based on annotated corpus data. Evaluation of user-system interaction and dialogue success is also based on annotated corpus data. As SLDSs and other language products become more sophisticated, the demand will grow for corpora with multilevel and cross-level annotations, i.e. annotations which capture information in the raw data at several different conceptual levels or mark up phenomena which refer to more than one level. These developments will inevitably increase the demand for standard tools in support of the annotation process. The production (recording, transcription, annotation, evaluation) of corpus data for spoken language applications continues to be time-consuming and costly. So is the construction of tools which facilitate annotation and information extraction. It is therefore desirable that already available annotated corpora and tools be used whenever possible. Re-use of annotated data and tools, however, confronts systems developers with numerous problems which basically derive from the lack of common standards. So far, language engineering projects usually have either developed the needed resources from scratch using homegrown formalisms and tools, or painstakingly adapted resources from previous projects to novel purposes. In recent years, several projects have addressed annotation formats and tools in support of annotation and information extraction (for an overview, see http://www.ldc.upenn.edu/-annotation/). Some projects have addressed the issue of markup standardisation from different perspectives. Examples are the Text Encoding Initiative (TEI) (http://www-tei.uic.edu/orgs/tei/ and http://etext.virginia.edu/TEI.html), the Corpus Encoding Standard (CES) (http://www.cs.vassar.edu/CES/), and the European Advisory Group for Language Engineering Standards (EAGLES) (http://www.-ilc.pi.cnr.it/EAGLES96/home.html). Whilst these initiatives have made good progress on written language and current coding practice, none of them have focused on the creation of standards and tools for cross-level spoken language corpus annotation. It is only recently that there has been a major effort in this domain. The project Multi-level Annotation Tools Engineering (MATE) (http://mate.nis.sdu.dk) was launched in March 1998 in response to the need for standards and tools in support of creating, annotating, evaluating and exploiting spoken language resources. The central idea of MATE has been to work on both annotation theory and practice in order to connect the two through a flexible framework which can ensure a common and user-friendly approach across annotation levels. On the tools side, this means that users are able to use level-independent tools and an interface representation which is independent of the internal coding file representation. This paper presents the MATE markup framework and its use in the MATE Workbench. In the following, Section 2 briefly reviews the MATE approach to annotation and tools standardisation. Section 3 presents the MATE markup framework. Section 4 concludes the paper by reporting on early experiences with the practical use of the markup framework and discussing future work. 2 The MATE Approach This section first briefly describes the creation of the MATE markup framework and a set of example best practice coding schemes in accordance with the markup framework. Then it describes how a toolbox (the MATE Workbench) has been implemented to support the markup framework by enabling annotation on the basis of any coding scheme expressed according to the framework. 2.1 Theory The theoretical objectives of MATE were to specify a standard markup framework and to identify or, when necessary, develop a series of best practice coding schemes for implementation in the MATE Workbench. To these ends, we began by collecting information on a large number of existing annotation schemes for the levels addressed in the project, i.e. prosody, (morpho-)syntax, co-reference, dialogue acts, communication problems, and cross-level issues. Cross-level issues are issues which relate to more than one annotation level. Thus, for instance, prosody may provide clues for a variety of phenomena in semantics and discourse. The resulting report (Klein et al., 1998) describes more than 60 coding schemes, giving details per scheme on its coding book, the number of annotators who have worked with it, the number of annotated dialogues/segments/ utterances, evaluation results, the underlying task, a list of annotated phenomena, and the markup language used. Annotation examples are


Loading Unlocking...
Login

Join to view The MATE Markup Framework and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The MATE Markup Framework and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?