ISLRN

MADCAT Phase 1-3 Composite Evaluation Set

Full Official Name: MADCAT Phase 1-3 Composite Evaluation Set

Submission date: May 5, 2026, 6:36 p.m.

**Introduction** MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phases 1-3 Composite Evaluation Set (LDC2026T05) contains the evaluation data created by the Linguistic Data Consortium (LDC) to support Phases 1-3 of the DARPA MADCAT Program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output. The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents. **Data** Arabic source documents were collected by LDC in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some source documents separated into multiple pages for handwriting. Each resulting handwritten page was assigned to up to three independent scribes using different writing conditions. The handwritten, transcribed documents were checked for quality and completeness; then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text. In the final step, a unified data format was produced consisting of the source text, tokenization and sentence segmentation; an image layer of bounding boxes; a scribe demographic layer containing scribe ID and partition (train/test); and a document metadata layer. This release includes 1,643 images and corresponding annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. GEDI XML files contain ground truth annotations. **Sponsorship** This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program No. HR0011-08-1-004 and GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. **Updates** No updates at this time. **Copyright** Portions © 2007-2008 Agence France Presse, Al-Ahram, Al Hayat, Al Quds-Al Arabi, An Nahar, Asharq Al-Awsat, Assabah, Xinhua News Agency, © 2007- 2013, 2026 Trustees of the University of Pennsylvania

Creator(s)

David Lee

Safa Ismael

Dave Doermann

Stephanie Strassel

Song Chen

Stephen Grimes

Distributor(s)

Linguistic Data Consortium

Right Holder(s)

Linguistic Data Consortium

Status : Accepted

ISLRN :

604-223-719-294-7

Version

1.0

Source

https://catalog.ldc.upenn.edu/LDC2026T05

Resource Type

Primary Text

Media Type

Image

Text

Language(s)

Arabic

Access Medium

Web Download