|
|
| (80 intermediate revisions by 5 users not shown) |
| Line 1: |
Line 1: |
| − | ________________________________________
| + | ==Linguistic Annotation Wiki== |
| − | '''Linguistic Annotation'''
| + | |
| − | ________________________________________
| + | |
| − | This page describes tools and formats for creating and managing ''linguistic annotations''. `Linguistic annotation<nowiki>‘</nowiki> covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, "named entity" identification, co-reference annotation, and so on. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. This page began as a set of links to systems for speech annotation, and the coverage of textual annotation is still inadequate.
| + | |
| − | This page is no longer being actively maintained.
| + | |
| − | '''Related pages:''' [<u>Open Language Archives Community-http://www.language-archives.org/]</u>, [<u>Linguistic Exploration-http://www.ldc.upenn.edu/exploration/]</u>, [<u>Gesture Annotation-http://www.ldc.upenn.edu/annotation/gesture/]</u>, [<u>Italian version of this page by Piero Cosi-http://nts.csrf.pd.cnr.it/biblos/annotazione-linguistica.htm]</u>, [<u>Speech Annotation and Corpus Tools-http://www.ldc.upenn.edu/annotation/specom.html]</u>
| + | |
| − | This page is the home of the [<u>COCOSDA-http://www.atr.co.jp/slt/cocosda/]</u> technical topic domain [<u>Corpus Annotation Tools-http://www.atr.co.jp/slt/cocosda/td_cat.html]</u>.
| + | |
| − | This page has been prepared in conjunction with our research on the logical structure of linguistic annotation, based on [<u>annotation graphs-http://www.ldc.upenn.edu/AG/]</u>.
| + | |
| − | <u>[IRCS Workshop on Linguistic Databases <nowiki>[</nowiki>Dec 2001<nowiki>]</nowiki>-http://www.ldc.upenn.edu/annotation/database/]</u>
| + | |
| − | Recently added links and updated descriptions are marked with a '''<nowiki>*</nowiki>'''.
| + | |
| − | {|
| + | |
| − | |- [<u>Steven Bird, Mark Liberman-mailto:sb@ldc.upenn.edu,myl@ldc.upenn.edu]</u>, [<u>LDC-http://www.ldc.upenn.edu/]</u>
| + | |
| − | |-
| + | |
| − | |}
| + | |
| − | ________________________________________
| + | |
| − | {|
| + | |
| − | |
| + | |
| − | |-
| + | |
| − | |Index
| + | |
| − | |-
| + | |
| − | |<u>[Alembic-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[ATLAS-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[CA-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[CES-http://www.ldc.upenn.edu/annotation/]</u>||<u>[CHILDES-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[CLinkA-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[COCOSDA-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[CSAE-http://www.ldc.upenn.edu/annotation/]</u>
| + | |
| − | |-
| + | |
| − | |<u>[CSLU-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[DAISY-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[DAMSL-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Delta-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[DRI-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[EAGLES-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Emu-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Festival-http://www.ldc.upenn.edu/annotation/]</u>
| + | |
| − | |-
| + | |
| − | |<u>[FSA<nowiki>‘</nowiki>s-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[GATE-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Gsearch-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[HIAT-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Hyperlex-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Intex-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[ISIP-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[ISLE-http://www.ldc.upenn.edu/annotation/]</u>
| + | |
| − | |-
| + | |
| − | |<u>[LACITO-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[LDC-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[LT XML-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[MATE-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[MICASE-http://www.ldc.upenn.edu/annotation/]</u>||<u>[MPEG-http://www.ldc.upenn.edu/annotation/]</u>||<u>[MPI-http://www.ldc.upenn.edu/annotation/]</u>||<u>[Multitext-http://www.ldc.upenn.edu/annotation/]</u>
| + | |
| − | |-
| + | |
| − | |<u>[NEGRA-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[NITE-http://www.ldc.upenn.edu/annotation/]</u>||<u>[Observer-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Partitur-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Praat-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SABLE-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SAMPA-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SGREP-http://www.ldc.upenn.edu/annotation/]</u>
| + | |
| − | |-
| + | |
| − | |<u>[SignStream-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SIL-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SLAM-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SMDL-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SNACK-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[SUSANNE-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[TalkBank-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[TASX-http://www.ldc.upenn.edu/annotation/]</u>'''<nowiki>*</nowiki>'''
| + | |
| − | |-
| + | |
| − | |<u>[TEI-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Tipster-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Transcriber-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[TransTool-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Treebank-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[TSNLP-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[TUSNELDA-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[Unicode-http://www.ldc.upenn.edu/annotation/]</u>
| + | |
| − | |-
| + | |
| − | |<u>[Verbmobil-http://www.ldc.upenn.edu/annotation/]</u> ||<u>[vPrism-http://www.ldc.upenn.edu/annotation/]</u> ||||||||||||
| + | |
| − | |-
| + | |
| − | |}
| + | |
| − | ________________________________________
| + | |
| − | {|
| + | |
| − | '''Key'''
| + | |
| − | '''F:'''|a systematically-documented annotation ''format''
| + | |
| − | '''T:'''|an available ''tool'' for creation, display or search
| + | |
| − | '''D:'''|a tool is ''downloadable''
| + | |
| − | '''P:'''|there is a citeable ''paper'' which documents the format/system
| + | |
| − | '''R:'''|other kinds of ''resource'', such as books and associations
| + | |
| − | '''C:'''|methods and standards for transcribing ''content''
| + | |
| − | '''<nowiki>[</nowiki>U/W/M<nowiki>]</nowiki>:'''|tools run on Unix/Windows/Macintosh
| + | |
| | | | |
| − | |}
| + | This wiki describes tools and formats for creating and managing ''linguistic annotations''. `Linguistic annotation<nowiki>‘</nowiki> covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, "named entity" identification, co-reference annotation, and so on. The focus is on tools which have been widely used for constructing annotated linguistic databases, and on the formats commonly adopted by such tools and databases. |
| − | {|
| + | |
| − | |'''Linguistic Resources'''
| + | This wiki is based on these webpages: |
| − | |-
| + | *[http://www.ldc.upenn.edu/annotation/ Linguistic Annotation] (Steven Bird and Mark Liberman) |
| − | |'''DT
| + | *[http://www.ldc.upenn.edu/annotation/gesture/ Gesture Annotation] (Craig Martell) |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[Alembic Workbench-http://www.mitre.org/technology/alembic-workbench] ([David Day-mailto:day@mitre.org])''' | + | |
| − | Alembic Workbench is an SGML-based annotation system. Apart from the usual kinds of textual annotations, the workbench enables various kinds of specialized annotations including co-reference annotations (cf. the Message Understanding Conference markup rules), various kinds of user-defined inter-tag pointers, and (shortly) general template annotation (aka relations, frames, or events). The Alembic multi-lingual NLP system provides access to taggers for a wide variety of extraction levels, and applications have now been built for several languages. The software has a sophisticated visualisation component. It runs on Sun workstations and is freely distributed.
| + | These are no longer maintained. Used with permission. |
| − | |-
| + | |
| − | |'''FTD'''||'''[LACITO Linguistic Data Archiving Project -http://195.83.92.32/presentation/index.html.en]([Boyd Michailovsky, John B. Lowe, Michel Jacobson-mailto:boydm@vjf.cnrs.fr,jblowe@socrates.berkeley.edu,jacobson@idf.ext.jussieu.fr])'''
| + | Related pages: |
| − | Projet Archivage, based at LACITO in Paris, aims to provide tools and formats for linguistic and anthropological field data. An interesting feature is the use of [<u>XML-http://www.w3.org/XML]</u> markup, with a DTD that supports transcriptions, phrasal and word-by-word interlinear translations, and audio references. Some [<u>XSL-http://www.w3.org/Style/XSL/]</u> style sheets are provided that illustrate the potential power of XML markup to support web browsing for material of this type, giving access to text and sound.
| + | *[http://www.language-archives.org/ Open Language Archives Community] |
| − | |-
| + | *[http://www.ldc.upenn.edu/exploration/ Linguistic Exploration] |
| − | |'''FP'''||'''[ATLAS - Architecture and Tools for Linguistic Analysis Systems-http://www.nist.gov/speech/atlas/] ([ Steven Bird, David Day, John Garofolo-mailto:sb@ldc.upenn.edu,day@mitre.org,john.garofolo@nist.gov])'''
| + | |
| − | ATLAS is a joint initiative of [<u>NIST-http://www.nist.gov/]</u>, [<u>MITRE-http://www.mitre.org/]</u> and the [<u>LDC-http://www.ldc.upenn.edu/]</u> to build a general purpose annotation architecture and a data interchange format. The starting point is the [<u>annotation graph-http://www.ldc.upenn.edu/sb/home/publications.html]</u> model, with some significant generalizations. An [<u>LREC paper-http://www.ldc.upenn.edu/sb/home/publications.html]</u> describes the model.
| + | ---- |
| − | |-
| + | For quicker reference, there's a page with transcription and annotation [[Tools]] only |
| − | |'''P'''||'''CA - Conversational Analysis '''
| + | |
| − | This page of [<u>transcriptions-http://www.sscnet.ucla.edu/soc/faculty/schegloff/prosody/]</u> by [<u>Emanuel Schegloff-mailto:schegloff@soc.ucla.edu]</u> exemplifies the style of transcription traditional among researchers working on conversational analysis.
| + | '''A''' |
| − | |-
| + | *[[Alembic Workbench]] (DT/U,W) |
| − | |'''FC'''||'''[CES-http://www.cs.vassar.edu/CES/] ([ Nancy Ide, Greg Priest-Dorman,Patrice Bonhomme-mailto:ide@cs.vassar.edu,priestdo@cs.vassar.edu,bonhomme@loria.fr])'''
| + | *[[Annotation Graph Toolkit (AGTK)]] (TDP) |
| − | The Corpus Encoding Standard (CES) is a part of the [<u>EAGLES Guidelines-http://www.ilc.pi.cnr.it/EAGLES]</u> developed for language engineering research and applications. CES is an SGML-based, [<u>TEI-http://www.ldc.upenn.edu/annotation/]</u>-conformant specification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora. A section of CES on speech annotation (part 6) is under construction. Projects using CES are listed [<u>here-http://www.cs.vassar.edu/CES/CES-P.html]</u>. An XML version of CES called [<u>XCES-http://www.cs.vassar.edu/XCES/]</u> is under development.
| + | *[[ANNIS]] |
| − | |-
| + | *[[annotate]] (TD) |
| − | |'''FTDPRC
| + | *[[Anvil]] (TP) |
| − | <nowiki>[</nowiki>WM<nowiki>]</nowiki>'''||'''[CHILDES-http://childes.psy.cmu.edu/] ([Brian MacWhinney, Steven Gillis-mailto:macw@cmu.edu,gillis@uia.ua.ac.be]''')
| + | *[[ATLAS]] (FP) |
| − | The CHILDES project provides a large database of first and second language acquisition data from over 30 languages in a constant format, called CHAT. There are also programs for Windows and Macintosh that permit analysis of this database as well as alignment of text to speech and video.
| + | '''C''' |
| − | |-
| + | *[[CA]] (P) |
| − | |'''T'''||'''[CLinkA: Coreferential Links Annotator-http://www.wlv.ac.uk/sles/compling/software/CLinkA.html] ([Constantin Orasan-mailto:in6093@wlv.ac.uk]''')
| + | *[[Callisto]] (TD/W,U,M) |
| − | CLinkA is a tool for annotating coreference, written in Java. It is one of several tools for working on anaphora being developed by the [<u>Computational Linguistics Group at Wolverhampton-http://www.wlv.ac.uk/sles/compling/]</u>
| + | *[[C-BAS]] (T/W) |
| − | |-
| + | *[[CES]] (FC) |
| − | |'''R'''||'''[COCOSDA-http://www.atr.co.jp/slt/cocosda/] ([Nick Campbell-mailto:nick@atr.co.jp]''')
| + | *[[CHILDES]] (FTDPRC/W,M) |
| − | The International Committee for the Co-ordination and Standardisation of Speech Databases and Assesment Techniques for Speech Input/Output, COCOSDA, has been established to encourage and promote international interaction and cooperation in the foundation areas of Spoken Language Processing.
| + | *[[CLinkA]] (T) |
| − | |-
| + | *[[COCOSDA]] (R) |
| − | |'''TDC
| + | *[[CSAE]] (TDC/W) |
| − | <nowiki>[</nowiki>W<nowiki>]</nowiki>'''||'''[CSAE-http://www.linguistics.ucsb.edu/research/sbcorpus/default.htm] ([John W. DuBois-mailto:dubois@humanitas.ucsb.edu])'''
| + | *[[CSLU]] (TDPRC/W) |
| − | The Corpus of Spoken American English (CSAE) project at UCSB has developed and released [<u>several tools-http://linguistics.ucsb.edu/resources/computing/download/download.htm]</u>, including VoiceWalker, a transcription tool for audio and video, and SoundWriter, which permits aligning portions of transcripts with sound files via SMPTE time codes. They have also developed their own set of [<u>transcription conventions-http://www.linguistics.ucsb.edu/research/sbcorpus/conventions/index.html]</u>.
| + | *[[CWB/CQP]] (TP/U) |
| − | |-
| + | '''D''' |
| − | |'''TDPRC
| + | *[[DAISY]] (FTP/U,W) |
| − | <nowiki>[</nowiki>W<nowiki>]</nowiki>'''||'''[CSLU-http://cslu.cse.ogi.edu/toolkit] ([ Kal Shobaki, Jacques De Villiers-mailto:shobaki@cse.ogi.edu,jacques@cse.ogi.edu])'''
| + | *[[DAMSL]] (FTRC/U,W) |
| − | The CSLU Speech Toolkit, developed at the Center for Spoken Language Understanding (CSLU) has a complete set of free tools for collecting and transcribing speech. It contains an interactive speech display program (Speech View) which allows the user to align transcripts with the sound files. The toolkit also contains a complete [<u>course-http://cslu.cse.ogi.edu/tutordemos]</u> in spectrogram reading and acoustic phonetics, a speech recognition engine, a speech synthesizer based on the [<u>Festival architecture-http://www.ldc.upenn.edu/annotation/]</u>, a facial animation component, and an integration tool for authoring your own spoken language system. CSLU also has an ascii encoding for phonetic transcriptions ([<u> large postscript file-ftp://speech.cse.ogi.edu/pub/docs/worldbet.ps]</u>).
| + | *[[Delta]] (TP/U,W) |
| − | |-
| + | *[[Dexter]] (T) |
| − | |'''FTP
| + | *[[DRI]] (R) |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[DAISY Consortium-http://www.daisy.org/] ([Ingar Beckman Hirschfeldt-mailto:ingar.beckman@tpb.se])'''
| + | '''E''' |
| − | The DAISY Consortium is a worldwide coalition of libraries and institutions serving print disabled persons, developing the open standards, tools, and techniques for the next generation of "digital talking books" (DTB). Standard [<u>2.0-http://www.daisy.org/dtbook/spec/2/final/daisy-201.html]</u> is available. Information about the process can be found in a [<u>CSUN98 presentation-http://www.dinf.org/csun_98/csun98_065.htm]</u>.
| + | *[[EAGLES]] (FR) |
| − | |-
| + | *[[ELAN]] (FTD) |
| − | |'''FTRC
| + | *[[E-MELD]] (R) |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[DAMSL-http://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/RevisedManual.html] / [SWBD-DAMSL-http://stripe.colorado.edu/%7Ejurafsky/manual.august1.html] ([James Allen-mailto:jamesall@linc.cis.upenn.edu], [Dan Jurafsky-mailto:jurafsky@Colorado.edu])'''
| + | *[[Emu]] (FTDP/U,W) |
| − | DAMSL - Dialog Act Markup in Several Layers - `defines a set of primitive communicative actions that can be used to analyze dialogs<nowiki>‘</nowiki>. The structure of DAMSL is the result of work by the Multiparty Discourse Group at the [<u>DRI-http://www.ldc.upenn.edu/annotation/]</u> meetings. SWBD-DAMSL was "designed as an augmentation to the <nowiki>[</nowiki>DAMSL<nowiki>]</nowiki> tag-set", for the purpose of annotating (a portion of) the Switchboard corpus. A page of links to online papers is available [<u>here-http://www.cs.rochester.edu/research/cisd/resources/damsl/papers.html]</u>. Some dialogues in the [<u>TRAINS Project-http://www.cs.rochester.edu/research/trains/]</u> have been marked up using DAMSL annotation - pointers [<u>here-http://www.cs.rochester.edu/research/trains/annotation/]</u>. There is also a dialog annotation tool called [<u>dat-http://www.cs.rochester.edu/research/trains/annotation/]</u>.
| + | *[[EXMARaLDA]] (FTDP/U,W,M) |
| − | |-
| + | '''F''' |
| − | |'''TP
| + | *[[Festival]] (TD/U) |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[Delta-http://www.eloq.com/White1297-25.htm] ([Susan Hertz-mailto:hertz@eloq.com])'''
| + | *[[FLEX (Fieldworks Language Explorer)]] |
| − | <u>[Eloquent Technologies-http://www.eloq.com/]</u> has developed a [<u>text-to-speech toolkit-http://www.eloq.com/tts.html]</u> which synthesizes speech by means of a multi-tiered representation of the text called a Delta.
| + | *[[FORM]] (C) |
| − | |-
| + | *[[FSA's]] (TD) |
| − | |'''R'''||'''[DRI-http://www.georgetown.edu/luperfoy/Discourse-Treebank/dri-home.html] ([Susann Luperfoy-mailto:luperfoy@iet.com])'''
| + | '''G''' |
| − | The Discourse Resource Initiative is an `international grass-roots effort that seeks to share corpora that have been tagged with the core features of interest to the discourse community.<nowiki>‘</nowiki> DRI brings together information on documents, corpora and software associated with discourse research, including a list of links to [<u>annotation tools-http://www.georgetown.edu/luperfoy/Discourse-Treebank/tools-and-resources.html]</u>. DRI includes a [<u>Multiparty Discourse Group-http://www.cs.rochester.edu/research/trains/annotation/]</u>. The [<u>COCONUT-http://www.isp.pitt.edu/%7Eintgen/]</u> project (COoperative, COordinated Natural language UTterances) has adapted the DRI scheme in order to deal with cooperative planning environments, and there is a page of pointers to [<u>online papers-http://www.isp.pitt.edu/%7Eintgen/research-papers.html]</u>.
| + | *[[GATE]] (FTDP/U) |
| − | |-
| + | *[[Gsearch]] (T/U) |
| − | |'''FR'''||'''[EAGLES Spoken Language Working Group-http://coral.lili.uni-bielefeld.de/EAGLES/] ([Dafydd Gibbon-mailto:gibbon@asl.uni-bielefeld.de]) '''
| + | '''H''' |
| − | The Spoken Language Working Group of [<u>EAGLES-http://www.ilc.pi.cnr.it/EAGLES96/home.html]</u> (the European Commission<nowiki>‘</nowiki>s Expert Advisory Group on Language Engineering Standards) has produced a ''Handbook of Standards and Resources for Spoken Language Systems'', "with information and recommendations on established practice in the area of spoken language system development in multilingual environments." The EAGLES project also comprised working groups on working standards for corpora, lexicons, formalisms and evaluation. The EAGLES-II project included further work on semantic annotation and dialogue annotation.
| + | *[[HIAT]] (FTDPRC/W) |
| − | |-
| + | *[[HIAT-DOS]] (T) ([[HIAT-DOS (Review)|Review]]) |
| − | |'''FTDP
| + | *[[Hyperlex]] (TP/U) |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[Emu-http://www.shlrc.mq.edu.au/emu/] ([Steve Cassidy, Jonathan Harrington-mailto:steve@srsuna.shlrc.mq.edu.au,jmh@srsuna.shlrc.mq.edu.au])'''
| + | '''I''' |
| − | The Emu system offers consistent access to diverse speech databases, with facilities for easy extraction of statistics, and support for database creation as well. Emu permits complex multi-tiered and hierarchical structures, and these can be built using a combination of manual and automatic annotation. An [<u>Emu script-http://www.shlrc.mq.edu.au/emu/partitur/partitur.html]</u> has been written to import certain [<u>Partitur-http://www.ldc.upenn.edu/annotation/]</u> files, lending support to Steve Cassidy<nowiki>‘</nowiki>s contention that (in terms of annotations proper, leaving out background information about recording, speaker etc.) the Emu framework is able to express the information content of Partitur annotations.
| + | *[[Intex]] (F/U,W,M) |
| − | |-
| + | *[[ISIP]] (TDP/U) |
| − | |'''TD
| + | *[[ISLE]] |
| − | <nowiki>[</nowiki>U<nowiki>]</nowiki>'''||'''[Festival-http://www.cstr.ed.ac.uk/projects/festival/architecture.html] ([Paul Taylor-mailto:pault@cstr.ed.ac.uk])'''
| + | '''L''' |
| − | The Festival Speech Architecture from CSTR at Edinburgh, which was originally designed for use in speech synthesis, has been generalized and applied to database analysis as well. Festival uses [<u>Heterogeneous Relation Graphs-http://www.cstr.ed.ac.uk/publications/new/draft/Taylor_draft_a.ps]</u> for representing linguistic information.
| + | *[[LACITO]] Linguistic Data Archiving Project (Boyd Michailovsky, John B. Lowe, Michel Jacobson) (FTD) |
| − | |-
| + | *[[LAF]] Linguistic Annotation Framework |
| − | |'''TD'''||'''FSA: Finite State Automata '''
| + | *[[LDC]] (FTDPRC) |
| − | There has been development of concepts and toolkits for n-ary finite automata, which might provide a useful model for expression, creation and search of multidimensional linguistic annotations. We can cite the following toolkits: [<u>LADL-http://www.ladl.jussieu.fr/INTEX/index.html]</u>, [<u>Xerox-http://www.xrce.xerox.com/research/mltt/fst]</u>, [<u>AT&T-http://www.research.att.com/sw/tools/fsm]</u>, [<u>van Noord-http://odur.let.rug.nl/%7Evannoord/fsa/fsa.html]</u> and [<u>Grail-http://www.csd.uwo.ca/research/grail/]</u> though as far as we know, these have not been used for general manipulation of speech annotations, and may not be suitable in detail for this purpose.
| + | *[[LT]] (T/U,W) |
| − | |-
| + | '''M''' |
| − | |'''FTDP
| + | *[[MacShapa]] (TP) |
| − | <nowiki>[</nowiki>U<nowiki>]</nowiki>'''||'''[GATE-http://www.dcs.shef.ac.uk/research/groups/nlp/gate/] ([Hamish Cunningham, Kevin Humphreys-mailto:gate@dcs.shef.ac.uk])'''
| + | *[[MacVissta]] (TD) |
| − | GATE has an implementation of the [<u>Tipster architecture-http://www.ldc.upenn.edu/annotation/]</u>, plus graphical tools for data visualisation, annotation, evaluation and process control. It is distributed free for research and with Information Extraction software.
| + | *[[MATE]] (FT) |
| − | |-
| + | *[[MediaStreams]] (P) |
| − | |'''T
| + | *[[MediaTagger]] (P) |
| − | <nowiki>[</nowiki>U<nowiki>]</nowiki>'''||'''[Gsearch-http://www.hcrc.ed.ac.uk/gsearch/] ([Frank Keller-mailto:keller@cogsci.ed.ac.uk])'''
| + | *[[MICASE]] (TDC/W) |
| − | Gsearch is a tool for searching tagged corpora. Queries are formulated in two stages. First, the user specifies a context-free grammar which is used to parse a given corpus and to convert its tags into a standard set. Second, a search expression uses the words, terminals and non-terminals provided by the corpus and grammar. Structured output may be visualized with Ratnaparkhi<nowiki>‘</nowiki>s [<u>Viewtree-http://www.cis.upenn.edu/%7Eadwait/tools/viewtree]</u> program or Calder<nowiki>‘</nowiki>s [<u>Thistle-http://www.ltg.ed.ac.uk/software/thistle/]</u> system. Gsearch currently works with BNC, Brown, [<u>SUSANNE-http://www.ldc.upenn.edu/annotation/]</u>, WSJ, Frankfurter Rundschau, and [<u>NEGRA-http://www.ldc.upenn.edu/annotation/]</u>.
| + | *[[MMAX]] (TD) |
| − | |-
| + | *[[MPEG]] (FPR) |
| − | |'''FTDPRC
| + | *[[MPI]] (FT/UWM) |
| − | <nowiki>[</nowiki>W<nowiki>]</nowiki>'''||'''[HIAT-http://www.daf.uni-muenchen.de/HIAT/HIAT.HTM] ([Konrad Ehlich, Jochen Rehbein-mailto:Ehlich@Daf.Uni-Muenchen.de,Rehbein@rrz.Uni-Hamburg.De]) '''
| + | *[[Multitext]] (F) |
| − | HIAT is a transcription system based on a score notation, developed in the 1970<nowiki>‘</nowiki>s by Ehlich and Rehbein, and widely used in Europe. The acronym stands for Halbinterpretative Arbeitstranskriptionen, or "semi-interpretative working transcription." Dafydd Gibbon is credited with the English name "Heuristic Interpretative Auditory Transcription," which preserves the acronym. The HIAT philosophy includes the notion of ''literary transcription'' (''literarische Umschrift''), which "involves systematic departures from the standard orthographic rendering of an item but in a manner that is meaningful to someone familiar with the orthographic system as a whole." Methods are provided for annotating prosody, non-verbal communication, and so on. PC and Mac software is available. An English-language description can be found in Edwards and Lampert, [<u>Talking Data-http://shop.barnesandnoble.com/booksearch/isbnInquiry.asp?isbn=0805803491]</u>. | + | '''N''' |
| − | |-
| + | *[[NEGRA]] (FTPC/U) |
| − | |'''TP
| + | *[[NITE]] |
| − | <nowiki>[</nowiki>U<nowiki>]</nowiki>'''||'''[Hyperlex-http://www.ldc.upenn.edu/hyperlex/] ([Steven Bird-mailto:sb@unagi.cis.upenn.edu])'''
| + | '''O''' |
| − | Steven Bird<nowiki>‘</nowiki>s Hyperlex system, developed in support of a field project, provides HTML-mediated access to a lexicon, speech recordings and paradigmatic catalogues for several languages. Steven plans to produce a portable version that can easily be adapted to new languages and new projects.
| + | *[[Observer]] (T/W) |
| − | |-
| + | '''P''' |
| − | |'''F
| + | *[[Partitur]] (FT) |
| − | <nowiki>[</nowiki>UWM<nowiki>]</nowiki>'''||'''[INTEX-http://www.bestweb.net/%7Eintex] ([Max Silberztein-mailto:silberz@ladl.jussieu.fr])'''
| + | *[[PAULA]] (F) |
| − | INTEX is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. INTEX includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. INTEX can build lemmatized concordances and indices of large texts with respect to all types of Finite State patterns. INTEX is used as an information retrieval system, to analyze literary texts, to quantify language variations, to teach second languages, as a terminological extractor, etc. Large coverage linguistic resources are already available for English, French, German, Greek, Italian, Polish, Portuguese.
| + | *[[Praat]] (TD/U,W,M) |
| − | |-
| + | '''R''' |
| − | |''' '''||'''[ISLE-http://www.ilc.pi.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm] ([Antonio Zampolli-mailto:eagles@ilc.pi.cnr.it])'''
| + | *[[RSTTool]] (TD) |
| − | The ISLE project is funded for two years by the NSF and the EC, as part of a joint program [<u>Multilingual Information Access and Management-http://www.nsf.gov/cgi-bin/getpub?nsf99102]</u>. This program is `intended to further the knowledge required to build information systems that operate in multiple languages; to provide the technologies required for their application in a number of social and organizational contexts; and to demonstrate the validity of the approaches chosen.<nowiki>‘</nowiki> An ISLE workshop was held at LREC-2, on [<u>Meta-Descriptions and Annotation Schemas for Multimodal/Multimedia Language Resources-http://www.mpi.nl/world/ISLE/]</u>. US-ISLE includes a [<u>Spoken Language Group-http://www.ldc.upenn.edu/sb/isle.html]</u>. <nowiki>[</nowiki>[<u>NSF award-https://www.fastlane.nsf.gov/servlet/showaward?award=9910603]</u><nowiki>]</nowiki>
| + | '''S''' |
| − | |-
| + | *[[SABLE]] (FP) |
| − | |'''TDP
| + | *[[SAMPA]] (C) |
| − | <nowiki>[</nowiki>U<nowiki>]</nowiki>'''||'''[ISIP-http://www.isip.msstate.edu/] ([Joe Picone-mailto:picone@ee.msstate.edu])'''
| + | *[[SGREP]] (TDP/U,W) |
| − | Joe Picone and others at the Institute for Signal and Information Processing (ISIP) at Mississippi State have produced nice freeware tools initially optimized for segmenting, transcribing and annotating telephone conversations: [<u>Segmenter-http://www.isip.msstate.edu/projects/speech/software/swb_segmenter/index.html]</u>, [<u>Transcriber-http://www.isip.msstate.edu/projects/speech/software/transcriber/index.html]</u>.
| + | *[[SignStream]] (TDP/M) |
| − | |-
| + | *[[SIL]] (TDPF/W,M) |
| − | |'''FTDPRC'''||'''[LDC-http://www.ldc.upenn.edu/] ([David Graff, Chris Cieri, Mark Liberman-mailto:graff@unagi.cis.upenn.edu,ccieri@unagi.cis.upenn.edu,myl@unagi.cis.upenn.edu])'''
| + | *[[SLAM]] (TDP/W) |
| − | The Linguistic Data Consortium has developed a range of (mainly SGML-based) formats for transcripts and other types of annotation that it has published (See below for [<u>NIST<nowiki>‘</nowiki>s UTF format-http://www.ldc.upenn.edu/annotation/]</u>, which provides a combined framework for several of these existing formats). Some online documentation is available for individual corpora authored at different times by different groups, e.g. [<u>Switchboard-http://morph.ldc.upenn.edu/doc/switchboard/manual.doc]</u> at TI in 1991, [<u>Trains-http://www.cs.rochester.edu/research/speech/dialog.html]</u> at Rochester in 1992-3, etc, as well as a [<u>general SGML-http://www.ldc.upenn.edu/kkarins/transpec.html]</u> transcription specification currently used for (orthographic) transcription of telephone conversations and broadcast news recordings. The LDC has also implemented a general data model for searching annotated text and speech corpora online, via [<u>LDC-Online-http://www.ldc.upenn.edu/lol]</u>.
| + | *[[SMDL]] (P) |
| − | |-
| + | *[[SNACK]] (TDP/U,W,M) |
| − | |'''T
| + | *[[SUSANNE]] (CP) |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[LT XML-http://www.ltg.ed.ac.uk/software/xml/] ()'''
| + | *[[SyncWriter]] (T) ([[SyncWriter (Review)|Review]]) |
| − | <u>[sggrep-http://www.ltg.ed.ac.uk/corpora/xmldoc/release/r634.htm]</u> - an XML-aware grep tool.
| + | '''T''' |
| − | |-
| + | *[[TalkBank]] (R) |
| − | |'''FT'''||'''[MATE-http://mate.nis.sdu.dk/] ([Laila Dybkjaer-mailto:laila@nis.sdu.dk])'''
| + | *[[TASX]] (TD/U,W,M) |
| − | The multi-partner EC-funded MATE project (Telematics Project LE4-8370) aims to develop an SGML-based standard for annotating spoken dialogue corpora, and tools to "make the processes of knowledge acquisition and extraction more efficient." [<u>Deliverable D1.1-http://mate.nis.sdu.dk/about/D1.1/]</u> surveys a wide variety of annotation schemes, and includes a useful [<u>page of links-http://www.dfki.de/mate/d11/annex.html]</u>. MATE implements Thompson and McKelvie<nowiki>‘</nowiki>s notion of [<u>standoff markup-http://www.ltg.ed.ac.uk/%7Eht/sgmleu97.html]</u>, which is related to the [<u>CES-http://www.ldc.upenn.edu/annotation/]</u> and [<u>TEI-http://www.ldc.upenn.edu/annotation/]</u> proposals.
| + | *[[TEI]] (F) |
| − | |-
| + | *[[Tipster]] (F) |
| − | |'''TDC
| + | *[[Transcriber]] (TDP/U,W,M) ([[Transcriber (Review)|Review]]) |
| − | <nowiki>[</nowiki>W<nowiki>]</nowiki>'''||'''[MICASE-http://www.lsa.umich.edu/eli/micase/micase.htm] ([Rita Simpson-mailto:ritacsim@umich.edu])'''
| + | *[[Transana]] (T) ([[Transana (Review)|Review]]) |
| − | The Michigan Corpus of Academic Spoken English (MICASE) is an ongoing project to record and transcribe a diverse range of academic speech, based at the [<u>English Language Institute-http://www.lsa.umich.edu/eli/]</u> of the University of Michigan. To support transcription work, a freeware Windows 95 program has been developed, called [<u>SoundScriber-http://www.lsa.umich.edu/eli/micase/micase.htm]</u>.
| + | *[[Transformer]] (TDP) |
| − | |-
| + | *[[TransTool]] (TD/U,W) |
| − | |'''FPR'''||'''[MPEG-http://drogo.cselt.it/mpeg/] ([Leonardo Chiariglione-mailto:leonardo.chiariglione@cselt.it])'''
| + | *[[Treebank]] (C) |
| − | <u>[MPEG-4-http://drogo.cselt.it/mpeg/standards/mpeg-4/mpeg-4.htm]</u>, established as an ISO/IEC standard in early 1999, provides standards for streaming interactive multimedia. More specifically, it aims to support production, distribution and content access for digital television, interactive graphics, and interactive multimedia. It provides representations for basic "media objects," which might be recorded or synthetic; it describes the composition of compound media objects, or "scenes"; it provides for multiplexing and synchronizing of such data for network transport; and it offers standards for interaction with the audiovisual scenes generated at the receiver. MPEG-4<nowiki>‘</nowiki>s media objects include text, graphics, talking synthetic heads and associated text, synthetic sounds, still images, video and audio elements. It supports complex combination of these elements into time-varying scenes, and also provides for streaming of the underlying data, and interaction with the receiver. Recently, MPEG-4 has been partly [<u>integrated with Apple<nowiki>‘</nowiki>s Quicktime file format-http://www.internetwk.com/news/news0211-15.htm]</u>, using QuickTime as "the starting point for the development of a unified digital media storage format for the MPEG-4 specification." How much of MPEG-4<nowiki>‘</nowiki>s ambitious program has been fully defined or implemented remains unclear.
| + | *[[TSNLP]] (FT) |
| − | <u>[MPEG-7-http://drogo.cselt.it/mpeg/standards/mpeg-7/mpeg-7.zip]</u> is `the content representation standard for multimedia information search, filtering, management and processing, (to be approved July 2001).<nowiki>‘</nowiki> Draper discusses the need for multiple non-commensurable structures for video annotation in his paper [<u>MPEG-7 and IR-http://www.psy.gla.ac.uk/%7Esteve/mpeg7.html]</u>.
| + | *[[TUSNELDA]] |
| − | |-
| + | '''U''' |
| − | |'''FT
| + | *[[Unicode]] (RC) |
| − | <nowiki>[</nowiki>UWM<nowiki>]</nowiki>'''||'''[Linguistic Applications at MPI-http://www.mpi.nl/world/tg/lapp/lapp.html] ([Peter Wittenburg-mailto:peter.wittenburg@mpi.nl])'''
| + | '''V''' |
| − | The [<u>MPI Language and Cognition Group-http://www.mpi.nl/world/groups/lcog.html]</u> have produced a variety of tools for working with annotated speech and video data. [<u>EUDICO-http://www.mpi.nl/world/tg/lapp/eudico/eudico.html]</u> (European Distributed Corpora Project) is a universal workbench for corpus linguistics, written in Java and operating with a variety of formats, including [<u>CHAT-http://www.ldc.upenn.edu/annotation/]</u>, [<u>Shoebox-http://www.ldc.upenn.edu/annotation/]</u>, [<u>Tipster-http://www.ldc.upenn.edu/annotation/]</u>, and various relational database formats. [<u>CAVA-http://www.mpi.nl/world/tg/CAVA/CAVA.html]</u> (Computer Assisted Video Analysis) is a suite of programs intended for scientists in the humanities, including a [<u>Transcription Editor-http://www.mpi.nl/world/tg/CAVA/ted/ted.html]</u> for digital transcription of analog video on PCs, and a Macintosh program called [<u>MediaTagger-http://www.mpi.nl/world/tg/CAVA/mt/MTandDB.html]</u>, for creating and searching multi-tier annotation of digital video in QuickTime format. A new multi-platform tool for managing CHAT-like transcriptions and aligned speech data, the [<u>Spoken Childes Tool-http://www.mpi.nl/world/tg/spoken-childes/spoken-childes.html]</u>, is available. Many of these tools are available free to academic sites.
| + | *[[Verbmobil]] (FC) |
| − | |-
| + | *[[VisLab]] (TDP) |
| − | |'''F'''||'''[Multitext Project-http://wheat.uwaterloo.ca/] ([ Gord Cormack, Forbes Burkowski, Charlie Clarke -mailto:cormack@plg.uwaterloo.ca,fjburkow@plg.uwaterloo.ca,clclarke@eecg.toronto.edu])'''
| + | *[[vPrism]] (T/W) |
| − | The MultiText Project is concerned with developing techniques for the indexing and retrieval of massive text corpora. Queries can refer to document structure, even though this is annotated in different ways in different corpora.
| + | |
| − | |-
| + | =Key= |
| − | |'''FTPC
| + | |
| − | <nowiki>[</nowiki>U<nowiki>]</nowiki>'''||'''[NEGRA Corpus-http://www.coli.uni-sb.de/sfb378/negra-corpus/] ([ Thorsten Brants-mailto:thorsten@coli.uni-sb.de])'''
| + | F: a systematically-documented annotation format |
| − | The NEGRA corpus consists of approximately 10,000 sentences of German newspaper text. The corpus is a type of treebank, but with a novel [<u>annotation scheme for discontinuous constituents-http://www.coli.uni-sb.de/%7Ethorsten/publications/Skut-ea-ANLP97.ps.gz]</u>. An example tree showing the visual format, the annotation format, and the [<u>Penn Treebank-http://www.ldc.upenn.edu/annotation/]</u> equivalent, is available [<u>here-http://www.coli.uni-sb.de/sfb378/negra-corpus/sentno3.html]</u>. [<u>Annotate-http://www.coli.uni-sb.de/sfb378/negra-corpus/annotate.html]</u> is a sophisticated tool which supports human-machine collaboration on the construction of syntactic trees.
| + | |
| − | |-
| + | T: an available tool for creation, display or search (W=Windows, U=Unix, M=MacOS) |
| − | |''' '''||'''NITE: Natural Interactivity Tools Engineering ([Laila Dybkjaer-mailto:laila@nis.sdu.dk]''')
| + | |
| − | NITE aims to build a "standard-setting workbench for multi-level, cross-level and cross-modality annotation, retrieval and practical exploitation of multi-party human-human and human-machine natural interactive dialogue data."
| + | D: a tool is downloadable |
| − | |-
| + | |
| − | |'''T
| + | P: there is a citeable paper which documents the format/system |
| − | <nowiki>[</nowiki>W<nowiki>]</nowiki>'''||'''[The Observer-http://www.noldus.com/products/observer/index.html] ([Lucas Noldus-mailto:l.noldus@noldus.nl])'''
| + | |
| − | `The Observer<nowiki>‘</nowiki> is a commercial tool for classifying and logging events. In the video version of the tool, one can create time-aligned annotations of video recordings, using the [<u>Event Recorder-http://www.noldus.com/products/observer/shorttour/obst_er.html]</u>. The temporal patterning of the observations can be displayed using the [<u>time-event plot-http://www.noldus.com/products/observer/shorttour/obst_sa.html]</u>, and various summary statistics can be generated. The software was developed by [<u>Noldus IT-http://www.noldus.com/]</u> and runs on MS Windows platforms.
| + | R: other kinds of resource, such as books and associations |
| − | |-
| + | |
| − | |'''FT'''||'''[Partitur-http://www.phonetik.uni-muenchen.de/Bas/BasFormatseng.html] ([Florian Schiel, Christoph Draxler-mailto:schiel@phonetik.uni-muenchen.de,draxler@phonetik.uni-muenchen.de])'''
| + | C: methods and standards for transcribing content |
| − | The [<u>Bavarian Archive of Speech Signals-http://www.phonetik.uni-muenchen.de/Bas/BasHomeeng.html]</u> has created the Partitur format based on their experience with a variety of speech databases. The aim has been to create ``an open (that is extensible), robust format to represent results from many different research labs in a common source.<nowiki>‘</nowiki><nowiki>‘</nowiki>
| + | |
| − | |-
| + | |
| − | |'''TD
| + | |
| − | <nowiki>[</nowiki>UWM<nowiki>]</nowiki>'''||'''[Praat-http://www.praat.org/] ([Paul Boersma-mailto:boersma@fon.hum.uva.nl])'''
| + | |
| − | The Praat system offers a variety of nice tools for interacting with speech data, including tools for transcribing and annotating on multiple tiers.
| + | |
| − | |-
| + | |
| − | |'''FP'''||'''[SABLE-http://www.bell-labs.com/project/tts/sable.html] ([Andrew Hunt, Richard Sproat, Paul Taylor-mailto:hunt@east.sun.com,rws@research.bell-labs.com,pault@cstr.ed.ac.uk])'''
| + | |
| − | The SABLE standard for annotation of linguistic properties of speech synthesis input necessarily shares a lot of characteristics with systems for linguistic annotation of naturally produced speech.
| + | |
| − | |-
| + | |
| − | |'''C'''||'''[SAMPA-http://www.phon.ucl.ac.uk/home/sampa/home.htm] ([John Wells-mailto:j.wells@ucl.ac.uk]) '''
| + | |
| − | SAMPA (Speech Assessment Methods Phonetic Alphabet) is a machine-readable ASCII transliteration of the [<u>International Phonetic Alphabet-http://www.arts.gla.ac.uk/IPA/ipa.html]</u>. Originally developed by phoneticians to code six European languages, it is currently being extended to cover many more languages. [<u>SAMPROSA-http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm]</u> is an extension for transcribing prosodic information, and [<u>XSAMPA-http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm]</u> is an extension which covers every symbol on the IPA chart, in principle allowing transcription of all the world<nowiki>‘</nowiki>s languages. See also the [<u>CSLU Worldbet-http://www.ldc.upenn.edu/annotation/]</u>.
| + | |
| − | |-
| + | |
| − | |'''TDP
| + | |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[SGREP-http://www.cs.helsinki.fi/%7Ejjaakkol/sgrep.html] ([Jani Jaakkola, Pekka Kilpeläinen-mailto:jjaakkol@cs.helsinki.fi,Pekka.Kilpelainen@cs.helsinki.fi]) '''
| + | |
| − | SGREP (structured grep) is a tool for searching and indexing text, SGML,XML and HTML files and filtering text streams using structural criteria. The data model of sgrep is based on regions, which are nonempty substrings of text. Regions are typically occurrences of constant strings, SGML-tags, or meaningful text elements, which are recognizable through some delimiting strings or the builtin SGML, XML and HTML parser. Regions can be arbitrarily long, arbitrarily overlapping, and arbitrarily nested. There is also a [<u>paper-http://www.cs.helsinki.fi/TR/C-1996/83/]</u> which would be useful for anyone wishing to use SGREP.
| + | |
| − | |-
| + | |
| − | |'''TDP
| + | |
| − | <nowiki>[</nowiki>M<nowiki>]</nowiki>'''||'''[SignStream-http://web.bu.edu/asllrp/SignStream/] ([Carol Neidle, Dawn MacLaughlin-mailto:carol@louis-xiv.bu.edu,dawn@louis-xiv.bu.edu])'''
| + | |
| − | The SignStream project aims to develop a database tool for transcription and analysis of video-based language data (particularly, signed language data). SignStream allows a user to enter data into any number of user-definable fields, where each datum is associated with a start and end frame of a video. Although a SignStream database is stored in a non-readable, binary format, the program includes a text export feature. However, there is no import feature. The program is currently being distributed at cost to researchers, educators, and students.
| + | |
| − | |-
| + | |
| − | |'''TDPF
| + | |
| − | <nowiki>[</nowiki>WM<nowiki>]</nowiki>'''||'''[SIL-http://www.sil.org/computing/sil_computing.html] ([Larry Hayashi, Gary Simons, Terry Gibbs-mailto:Larry_Hayashi@sil.org,Gary.Simons@sil.org,Terry_Gibbs@sil.org])'''
| + | |
| − | The Summer Institute of Linguistics (SIL) has enormous experience in providing tools and data formats for use in primary linguistic description. [<u>LinguaLinks-http://www.sil.org/lingualinks/]</u> is "an electronic productivity support system for language workers," based on the [<u>CELLAR-http://www.sil.org/cellar/]</u> object-oriented "Computing Environment for Linguistic, Literary and Anthropological Research.", including [<u>linguistics tools-http://www.sil.org/lingualinks/LingTool.html]</u>. Other SIL software tools are [<u>Speech Analyser and Speech Manager-http://www.sil.org/computing/speechtools/]</u>, Windows programs for labelling speech files and for searching a database of labelled speech files; [<u>Shoebox-http://www.sil.org/computing/shoebox.html]</u>, for interlinear text annotation and its predecessor, [<u>IT-gopher://gopher.sil.org/11/gopher_root/computing/software/linguistics/text_analysis/it/]</u>. SIL also has an SGML based annotation format named PTEXT ("parsed text") described in this [<u>paper-http://gamma.sil.org/silewp/1997/008/]</u>.
| + | |
| − | |-
| + | |
| − | |'''TDP
| + | |
| − | <nowiki>[</nowiki>W<nowiki>]</nowiki>'''||'''[SLAM-http://nts.csrf.pd.cnr.it/IFeD/Pages/slam.htm] ([Piero Cosi-mailto:cosi@csrf.pd.cnr.it])'''
| + | |
| − | The `Segmentation and Labelling Automatic Module<nowiki>‘</nowiki> (SLAM) is a tool for semi-automatic segmentation and labelling of speech signals. The tool was developed at the [<u>Institute of Phonetics and Dialectology-http://nts.csrf.pd.cnr.it/IFeD/]</u>, and runs under MS Windows.
| + | |
| − | |-
| + | |
| − | |'''P'''||'''[SMDL-http://www.student.brad.ac.uk/srmounce/smdl.html] ([Stephen R. Mounce-mailto:S.R.Mounce@comp.brad.ac.uk])'''
| + | |
| − | The annotation of instrumental and vocal music has some interesting similarities with the annotation of speech. See the [<u>Music Encoding Standards-http://www.student.brad.ac.uk/srmounce/encoding.html]</u> page for general pointers. There is a proposed standard, the Standard Music Description Language (SMDL), and the [<u>proposal-ftp://ftp.techno.com/pub/SMDL/10743.ps]</u> is available as well.
| + | |
| − | |-
| + | |
| − | |'''TDP
| + | |
| − | <nowiki>[</nowiki>UWM<nowiki>]</nowiki>'''||'''[Snack-http://www.speech.kth.se/snack/] ([K�re Sj�lander-mailto:kare@speech.kth.se])'''
| + | |
| − | Snack is a general toolkit to handle acoustic data with an emphasis on speech. It features real-time visualization, supports many file-formats, and is extensible. A speech signal viewer and phonetic label editor is included in the package. There is also a Snack "plug-in" for web browsers. [<u>Wavesurfer-http://www.speech.kth.se/wavesurfer/]</u> is a tool based on Snack.
| + | |
| − | |-
| + | |
| − | |'''CP'''||'''[SUSANNE-http://www.cogs.susx.ac.uk/users/geoffs/RSue.html] ([Geoffrey Sampson-mailto:geoffs@cogs.susx.ac.uk])'''
| + | |
| − | The SUSANNE annotation scheme provides a detailed encoding of the logical and surface grammar of English. The SUSANNE corpus/treebank, which is freely available, contains a subset of the Brown corpus that has been marked up according to this scheme. The [<u>CHRISTINE-http://www.cogs.susx.ac.uk/users/geoffs/RChristine.html]</u> project sets out to expand the SUSANNE analytic scheme and corpus to cover spoken English, and particularly spontaneous, informal spoken English.
| + | |
| − | |-
| + | |
| − | |'''R'''||'''[TalkBank-http://www.talkbank.org/] ([ Brian MacWhinney, Steven Bird, Mark Liberman, Peter Buneman-mailto:macw@cmu.edu,sb@ldc.upenn.edu,myl@unagi.cis.upenn.edu,peter@cis.upenn.edu])'''
| + | |
| − | TalkBank is a new research project whose goal is to foster fundamental research in the study of human and animal communication by providing standards and tools for creating, searching, and publishing primary linguistic materials on the Internet. The project grows out of the experience with [<u>CHILDES-http://www.ldc.upenn.edu/annotation/]</u> and [<u>LDC-http://www.ldc.upenn.edu/annotation/]</u> corpora, and aims to support the methodologies and notations of a broad range of disciplines including: classroom discourse, conversation analysis, discourse, language acquisition, gesture, signed languages, ethology, anthropology, field linguistics, speech analysis. TalkBank is using the [<u>ATLAS-http://www.ldc.upenn.edu/annotation/]</u> model.
| + | |
| − | |-
| + | |
| − | |'''TD
| + | |
| − | <nowiki>[</nowiki>UWM<nowiki>]</nowiki>'''||'''[TASX: Time Aligned Signal data eXchange format-http://coli.lili.uni-bielefeld.de/%7Emilde/tasx/] ([Jan-Torsten Milde-mailto:milde@coli.uni-bielefeld.de]''')
| + | |
| − | TASX provides a general framework for creating and managing corpora, including XML-based annotation of the multimodal data, transformation of non XML-annotations, and web-based analysis and dissemination of the data. TASX-annotator is a user-friendly program for multilevel annotation and transcription of (multi-channel) video and audio data. TASX-annotator runs under WinXX (98, XP, 2000), Linux, Solaris and MacOS, and is distributed under an open source license.
| + | |
| − | |-
| + | |
| − | |'''F'''||'''[TEI-http://www.tei-c.org/] ([Lou Burnard-mailto:lou.burnard@oucs.ox.ac.uk])'''
| + | |
| − | The Text Encoding Initiative (TEI) published its first detailed recommendations for the encoding and transcription of all manner of written and spoken materials, using an extensible SGML framework, in 1994. This was widely used by many academic research projects and digitization initiatives ([<u>examples-http://www.tei-c.org/Applications]</u>) and was also very influential in many corpus creation projects (e.g. BNC, Parole, Multex, Silfide) as well as in defining [<u>EAGLES-http://www.ldc.upenn.edu/annotation/]</u> and other standards. A membership consortium was formed in 2000 to continue maintenance and development of the TEI, and the most recent recommendations (P5) are expressed using a modular XML framework, which supports most kinds of linguistic annotation.
| + | |
| − | |-
| + | |
| − | |'''F'''||'''[Tipster-http://www.nist.gov/itl/div894/894.02/related_projects/tipster] ([Ralph Grishman, Robert Gaizauskas, Hamish Cunningham, Remi Zajac-mailto:grishman@grimm.nyu.edu,roberg@dcs.shef.ac.uk,hamish@dcs.shef.ac.uk,rzajac@crl.nmsu.edu])'''
| + | |
| − | This concerns annotation of text rather than speech, but has several interesting properties.
| + | |
| − | |-
| + | |
| − | |'''TDP
| + | |
| − | <nowiki>[</nowiki>UWM<nowiki>]</nowiki>'''||'''[Transcriber-http://www.ldc.upenn.edu/mirror/Transcriber] ([Claude Barras, Edouard Geoffrois-mailto:barras@etca.fr,Edouard.Geoffrois@etca.fr])'''
| + | |
| − | Transcriber is free software for transcribing and annotating digital audio, aimed initially at transcription of broadcast news data. Its user interface is written in Tcl/Tk. It uses the same transcription formats as the LDC<nowiki>‘</nowiki>s Broadcast News data, and has also been adapted for XML I/O. It was developed by Claude Barras and Edouard Geoffrois, at DGA in Paris, in collaboration with LDC. A new version of Transcriber will be based on the [<u>annotation graph-http://www.ldc.upenn.edu/sb/home/publications.html]</u> model, and the plans are described in an [<u>LREC paper-http://www.ldc.upenn.edu/sb/home/publications.html]</u>.
| + | |
| − | |- | + | |
| − | |'''TD
| + | |
| − | <nowiki>[</nowiki>UW<nowiki>]</nowiki>'''||'''[TransTool-http://www.ling.gu.se/SLSA/SLcorpus.html] ([ Jens Allwood, Elisabeth Ahls�n, Joakim Nivre-mailto:jens@ling.gu.se,elisa@ling.gu.se,nivre@ling.gu.se])'''
| + | |
| − | The Swedish Spoken Language Corpus, developed at the [<u>Department of Linguistics-http://www.ling.gu.se/]</u>, G�teborg University, has several interesting [<u>tools-http://www.ling.gu.se/SLSA/tools.html]</u>: Transtool, to aid in transcription; Synchtool, for synchronizing transcriptions with audio and video files; TRASA, a tool for automatic analysis of the corpus; and [<u>TRACTOR-http://www.ling.gu.se/%7Esl/tractor.html]</u>, a tool to support coding.
| + | |
| − | |-
| + | |
| − | |'''C'''||'''[Treebank-http://www.cis.upenn.edu/%7Etreebank/home.html] ([Mitch Marcus, Ann Taylor-mailto:mitch@linc.cis.upenn.edu,ataylor@linc.cis.upenn.edu])'''
| + | |
| − | The Penn Treebank Project has produced semantic and syntactic annotations of naturally-occuring text for the Wall Street Journal, Brown, ATIS and Switchboard Corpora. The annotations produced by the Treebank project were published by [<u>LDC-http://www.ldc.upenn.edu/annotation/]</u>. Treebank has two query languages: [<u>tgrep (at LDC-Online)-http://www.ldc.upenn.edu/ldc/online/treebank/]</u> and [<u>CorpusSearch-http://www.ling.upenn.edu/%7Edringe/CorpStuff/Manual/Contents.html]</u>. The principle advantage of tgrep is its speed, and of CorpusSearch is its ability to pipeline queries together. [<u>Chris Brew-mailto:Chris.Brew@edinburgh.ac.uk]</u> has recently developed an extensible visualisation tool to aid treebank exploration, called [<u>TreeStyle-http://www.ltg.ed.ac.uk/%7Echrisbr/styling-trees.ps]</u>. See also the [<u>NEGRA Corpus-http://www.ldc.upenn.edu/annotation/]</u>. Douglas Rohde has developed a more powerful version of tgrep called [<u>tgrep2-http://www.cs.cmu.edu/%7Edr/Tgrep2/]</u>. Treebanks for other languages are in development, including: [<u>German-http://www.ims.uni-stuttgart.de/projekte/TIGER/]</u>, [<u>Turkish-http://www.ii.metu.edu.tr/%7Ecorpus/treebank/index.html]</u>, [<u>Polish-http://www.ipipan.waw.pl/%7Eagn/CRIT.htm]</u>, [<u>Czech-http://shadow.ms.mff.cuni.cz/pdt/index.html]</u>, [<u>Portuguese-http://cgi.portugues.mct.pt/treebank/PaginaFloresta.html]</u>, [<u>Bulgarian-http://www.bultreebank.org/]</u>, [<u>Chinese-http://www.ldc.upenn.edu/ctb/]</u>, ...
| + | |
| − | |-
| + | |
| − | |'''FT'''||'''[TSNLP-http://cl-www.dfki.uni-sb.de/tsnlp/] ([ Klaus Netter, Doug Arnold, Stephan Oepen-mailto:netter@cl.dfki.uni-sb.de,doug@essex.ac.uk,oe@coli.uni-sb.de])'''
| + | |
| − | `Test Suites for Natural Language Processing<nowiki>‘</nowiki> is a European consortium providing NLP test suite technology and test suite fragments for German, French and English. For a description of the TSNLP annotation scheme, see their [<u>WP2.2-http://cl-www.dfki.uni-sb.de/tsnlp/publications.html]</u>. See section 5 for examples. The annotation scheme includes sentential information (such as a well-formedness score) and also word and sub-string annotation which is stored in tabular form (for analytical information including syntactic constituents and error descriptions). There is a [<u>web interface-http://tsnlp.dfki.uni-sb.de/tsnlp/tsdb/tsdb.cgi]</u> to the test suites. The database schema is described in the [<u>user manual-http://cl-www.dfki.uni-sb.de/tsnlp/manual.html]</u>, volume 2, section 4.
| + | |
| − | |-
| + | |
| − | |''' '''||'''[TUSNELDA: Tübingen Collection of Reusable, Empirical, Linguistic Data Structures -http://www.sfb441.uni-tuebingen.de/tusnelda-engl.html]([Andreas Wagner-mailto:wagner@sfs.nphil.uni-tuebingen.de])'''
| + | |
| − | <u>[SFB-441-http://www.sfb441.uni-tuebingen.de/index-engl.html]</u>, a new research program at the University of Tübingen, is collecting and developing data and annotation resources called TUSNELDA. A corpus annotation standard for TUSNELDA is under development.
| + | |
| − | |-
| + | |
| − | |'''RC'''||'''[Unicode-http://www.unicode.org/] ([info@unicode.org-mailto:info@unicode.org])'''
| + | |
| − | The Unicode Consortium brings together software corporations and researchers at the leading edge of standardizing international character encoding. The outcome of this cooperation is [<u>The Unicode Standard-http://www.unicode.org/unicode/standard/standard.html]</u>, which provides the foundation for internationalization and localization of software. Unicode has a [<u>conference series-http://www.unicode.org/unicode/conference/about-conf.html]</u> and an [<u>FAQ-http://www.unicode.org/unicode/faq/]</u>. There are [<u>character charts-http://charts.unicode.org/]</u>, including one for [<u>IPA extensions-http://charts.unicode.org/Web/U0250.html]</u>.
| + | |
| − | |-
| + | |
| − | |'''FC'''||'''[Verbmobil-http://verbmobil.dfki.de/overview-us.html] ([Wolfgang Wahlster, Reinhard Karger-mailto:wahlster@dfki.de,karger@dfki.de])'''
| + | |
| − | Verbmobil is a large German speech-to-speech translation project for the domains of appointment negotiation, travel planning and hotel reservation. The [<u>Verbmobil annotation project-http://coral.lili.uni-bielefeld.de/%7Evmobil/vm-anno/vm-annotations.html]</u> includes orthography, segmental annotation (with BAS [<u>Partitur-http://www.ldc.upenn.edu/annotation/]</u>) prosody (German ToBI), morphological and POS tagging, semantic annotation and dialog act annotation. [<u>Dafydd Gibbon-mailto:gibbon@spectrum.uni-bielefeld.de]</u> has developed a model for sharing lexical databases in Verbmobil, called [<u>HyprLex-http://coral.lili.uni-bielefeld.de/HyprLex/]</u>.
| + | |
| − | |-
| + | |
| − | |'''T
| + | |
| − | <nowiki>[</nowiki>M<nowiki>]</nowiki>'''||'''[vPrism-http://www.lessonlab.com/vprism/] ([vPrism@lessonlab.com-mailto:vPrism@lessonlab.com])'''
| + | |
| − | vPrism is commercial macintosh software for time-aligned annotation and coding of video, intended for use in educational and behavioral research.
| + | |
| − | |-
| + | |
| − | |''' '''
| + | |
| − | |-
| + | |
| − | |}
| + | |
| − | ''Last update: 5 December 2001
| + | |
| − | [Steven Bird-http://www.ldc.upenn.edu/sb], [sb@ldc.upenn.edu-mailto:sb@ldc.upenn.edu] ''
| + | |
These are no longer maintained. Used with permission.
T: an available tool for creation, display or search (W=Windows, U=Unix, M=MacOS)