CESS-ESP and CESS-CAT TREEBANK

The Universitat de Barcelona (CLiC-UB), the Universidad de Alicante
(UA), the Universitat Politècnica de Catalunya (UPC), and the Euskal
Herriko Unibertsitatea (EHU-UPV) are the sole and exclusive owners of
the CESS-Esp treebank.  Information for this corpus can be found at:
http://www.lsi.upc.edu/~mbertran/cess-ece2/

The goal of the CESS-ECE project is to create three corpora, one for
Spanish (CESS-ESP), one for Catalan (CESS-Cat) and one for Basque
(CESS-EUS), of 500,000 words for CESS-Esp and CESS-Cat and 350,000
words for the CESS-Eus. These corpora will be tagged in two ways:
syntactically (with constituents and functions for CESS-Esp and
CESS-Cat and with dependencies for CESS-Eus) and semantically (with
WordNet synsets). This project is based on resources from 3LB Project
(FIT 150500-2002-244), where 100,000 words per language were annotated
in the same way.

The version distributed with NLTK are syntactic treebanks (with
constituents and functions) consisting of 1377 files for Catalan and
610 for Spanish. They are treebanks (with POS and lemma) with a fairly
complete tagset documented on the project's website.

A sample tree for Spanish:

(
 (S
   (snp-SUJ
     (espec.ms
       (da0ms0 El el))
     (grup.nom.ms
       (ncms000 púgil púgil)
       (s.a.ms
         (grup.a.ms
           (aq0cs0 estadounidense estadounidense)))
       (snp
         (grup.nom.ms
           (np0000p Will_"Steel"_Grigsby Will_"Steel"_Grigsby)))))
   (grup.verb
     (vmis3s0 conquistó conquistar))
   (sn-CCT
     (espec.fs
       (dd0fs0 esta este))
     (grup.nom.fs
       (ncfs000 tarde tarde)))
....

In the Spanish corpus, files with an initial letter correspond to
different genres (only those from the original 3LB sample):

       A       press: articulistas     PRESS: OPINION
       E       press: ensayo   PRESS:ESSAY
       C       press: suplementosCiencia       PRESS: SCIENCE SUPLEMENT
       D       press: prensa deportiva         PRESS: SPORTS
       N       press: noticias         PRESS:NEWS
       R       press: semanarios       PRESS: WEEKLIES,
       T       fiction: narrativa      FICTION: NARRATIVE

CESS-ECE corpora should allow both grammar inference for syntactic
parsing and the training of Machine Learnig systems for word sense
disambiguation.

If you use these corpora for research, please cite thusly: CESS-Cat
project (M. Antonia Martí, MarionaTaulé, Lluís Márquez, Manuel
Bertran (2007) ?CESS-ECE: A Multilingual and Multilevel Annotated
Corpus? in http://www.lsi.upc.edu/~mbertran/cess-ece/publications).
