Resource: The CINTIL Corpus – International Corpus of Portuguese
|Reference||The CINTIL Corpus – International Corpus of Portuguese|
|Date of Submission||Jan. 24, 2014, 4:31 p.m.|
|Resource Type||Primary Text|
|Format/MIME Type||Plain text|
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).
The corpus is developed over raw textual materials of several types, of which 30% are spoken materials. This spoken subcorpus includes materials from several registers (ranging from formal to informal) and several communicative situations (e.g. phone calls, media broadcasts, conversations, monologues, formal exposition, etc.). The CINTIL corpus comprises the transcriptions of spoken texts but does not include the sound files with the recorded interviews. The remaining subcorpus is composed of written texts from several genres: newspaper, books, magazines, journals and miscellaneous (proceedings, dissertations, pamphlets, etc.). A detailed overview of the corpus composition is presented below:
• Written = 689,124 tokens:
The annotation manual is provided together with the corpus.
The corpus can be browsed online: http://cintil.ul.pt/