Resource: ROCO Romanian journalistic corpus

Reference ROCO Romanian journalistic corpus
Date of Submission Nov. 30, 2015, 5:13 p.m.
Status accepted
ISLRN 312-617-089-348-7
Resource Type Primary Text
Media Type Text
Source
Language Romanian
Description

ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities.

The corpus contains morphosyntactic information (MSD annotations) which has been assigned automatically with the high accuracy (estimated 98%) TTL tagger implementing the tiered tagging methodology. About 20% of the MSD annotations have been manually checked, validated and, where the case, corrected. MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added).

The corpus was first segmented, then PoS annotated and lemmatized with the TTL processing chain. The corpus has been XML encoded and each file includes metadata (cesHeader).

Version 1.0
Distributor ELRA