Resource: MAURDOR Evaluation Package

Reference MAURDOR Evaluation Package
Date of Submission Feb. 26, 2015, 5:52 p.m.
Status accepted
ISLRN 364-018-517-901-2
Resource Type Primary Text
Media Type Image
Source
Language Arabic, English, French
Size 27 Gb
Description

The MAURDOR project consists in evaluating systems for automatic processing of written documents. Collected written documents are scanned documents (printed, typewritten or manuscripts).

In order to get images for the evaluation of automatic analysis systems, 10,000 original documents were collected and annotated (5000 in French, 2500 in English and 2500 in Arabic). This package contains 8,129 documents out of the 10,000 originally collected.

Each of the 8129 documents belongs to one of the 5 following categories:
C1: Printed form (completed in manuscript)
C2: Commercial, private or professional document, printed or photocopied
C3: Manuscript private correspondence
C4: Typewritten private or professional correspondence
C5: Others

Once collected, those documents were submitted to a manual annotation. This human analysis is used as a reference, known as ground truth, for the training and evaluation of automatic processing systems.

Annotations aim to highlight the following information:
1. How the document is structured (text zones, images...)?
2. Which writings are present, with their type (manuscript/typewritten) and their language (French, English, Arabic, other)?
3. What is the main information in the documents (author, recipient, subject, date...)?

The MAURDOR evaluation campaign provides a common framework for the reporting of current performances of systems for automatic processing of digital documents. This package contains the material provided to the campaign participants:
- Consistent development and test data corresponding to the application concerned;
- Tools for the automatic measurement of system performances;
- A common assessment protocol applicable to each processing stage, along with a complete automatic processing chain for written documents.

The documents are provided in TIFF format and the annotations are provided in XML format.

The aim of this evaluation package is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

Version 1.0
Distributor ELRA