MASC-1.03   2010-09-19
----------------------

Full documentation for the MASC data and processing tools is
available at http://www.anc.org/MASC.

DATA DESCRIPTION
----------------
See http://www.anc.org/MASC/mascI_contents.html

CONTENTS
--------

data - 
   spoken - data and annotations for spoken data
   written - data and annotations for written data

   See http://www.anc.org/MASC/MASC_Structure.html for 
   details on the organization of MASC data and annotation files

MASC-corpus-header.xml - 
   Information about the corpus, contents, 
   organization, domain codes, naming conventions, etc.

original-annotations -
   Contains annotations contributed to MASC in their original 
   format, as prepared by the specific annotation project. These
   annotations are transduced to GrAF format for inclusion in
   MASC. In general, errors in the originals are left uncorrected.

 
RELEASE NOTES
-------------
Nature of the changes: Minor

Release 1.03 changes:

 - fixes additional misaligned Penn Treebank tokens

 - adds missing references to associated annotation files in the 
   document headers for the ICIC data.

Known Problems
--------------

Penn Treebank Tokens

There remain a few problems with the Penn Treebank tokenizations, 
primarily due to corrections to the texts made by the Treebank project
in the course of generating the original annotations. In some cases,
we have corrected the tokenizations to refer to the faulty text 
segment (e.g., "o" for "off" due to removal of ligatures in the 
course of transduction from original Quark Express files) and added 
an annotation named "corrected" whose value provides the correction to 
the text.

VOL15_3 Penn POS

Quote marks are not tokenized and marked with POS in the current version of
the Penn part-of-speech tags--this will be fixed in the next release.

Event co-reference

The original GATE annotations done at Carnegie-Mellon University used the 
annotation set name internal to GATE to group events that co-refer. This information 
is lost once the annotations are rendered in any format apart from GATE-readable 
output. The event co-reference annotations included in MASC are therefore a
subset of those in the originals, and do not include the grouping information contained
in the annotation set names. The differences can be seen by loading both the MASC event 
annotations and the original annotations into GATE. 


CONTACT
-------

MASC is a product of the ANC project. 

Email: anc@anc.org

Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 
USA


