ISLRN

Danish Gigaword Corpus

Full Official Name: Danish Gigaword Corpus

Submission date: Jan. 28, 2022, 3:16 p.m.

The Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is: 1. representative; 2. accessible; 3. a suitable common starting point for Danish NLP models. The present version 1.0 was collected from various websites. Domains are distributed as follows: - Legal : 308.8 million words - Social Media : 261.4 million words - Subtitles : 130.1 million words - Debates : 108.4 million words - Conversations : 0.7 million words - Web : 101.02 million words - Encyclopedia : 55.6 million words - Literature : 31.3 million words - Manuals : 2.6 million words - Books : 2.1 million words - Religion : 600k words - News: 40 million words - Other :1.2 million words Data is presented in plaintext, UTF8, one file per document. Accompanying metadata gives information about (among others) the author, the time or location of the document's creation, an API hook for re-retrieval of the document.

Creator(s)

Distributor(s)

ELRA

Right Holder(s)

Status : Accepted

ISLRN :

024-504-318-388-3

Version

1.0

Source

http://catalog.elra.info/en-us/repository/browse/ELRA-W0318

Resource Type

Primary Text

Media Type

Text

Language(s)

Access Medium