Launched at the end of 1995, the AMARYLLIS project aimed at evaluating information retrieval software for French text corpora in order to provide a methodology for the evaluation of other similar tools. AMARYLLIS was organised by the Institut de l'Information Scientifique et Technique (INIST) with the support of the Agence francophone pour l'enseignement supérieur et la recherche (AUPELF-UREF) and the French Ministère de l'Education Nationale, de la Recherche et de la Technologie (MERT).
More specifically, the objective was to create document corpora, questions and answers, in the framework of the Action de Recherche Concertée (ARC A1, renamed as Amaryllis- Access to text information in French), in order to get similar works to the United States project TREC.
All corpora are structured as SGML files with isolatin character-encoding.
The available corpora were provided by:
- INIST (Institut de l'Information Scientifique et Technique)
- OFIL (Observatoire Français et International des Industries de la Langue)
- ELRA (European Language Resources Association)
Each provider provided three types of corpora : text documents, seach topics and answers to these topics in the corresponding text corpora (with frames of reference for the answers).
1- Text documents in French
The text documents in French comprise:
- Articles (titles and texts) extracted from trhe newspaper "Le Monde"; each batch contains three months of documents, provided by OFIL (01-01-93/31-03-93, 01-04-93/30-06-93),
- Titles and summaries of scientific articles covering every domain from the Pascal bibliographical databases (from 1984 to 1995) and Francis (from 1992 to 1995), provided by INIST.
The tagging of the documents conforms to a simplified version of a DTD from the TEI, which includes the possibility to manage the logical structure.
2- Multilingual text documents
The multilingual text documents have been provided by ELRA, and comprise documents in 6 languages (French, English, Italian, Spanish, German and Portuguese), extracted from the parallel corpus MLCC which contains documents translated in official European languages (from 1992 to 1994). The corpus was divided in two sub-corpora: written questions (10 million words) and debates of the European Parliament (5 to 8 million de words per language).
3- Search topics
The topics derive from questions asked by end users, and should contain every information which is necessary to understand the issue they deal with and to estimate the relevance. They comprise the following items:
- A domain, to determine the field of knowledge they belong to,
- A topic: which equals to a title defining the subject,
- A question: which matches the question the user may ask,
- Complementary information: which gives details on further documents that should be selected from the corpus,
- Concepts: which are a set of descriptors used to set the limits of the search.
The topics have been built by OFIL, by some documentalists working for Le Monde who used requests from journalists, and by engineers responsible for documentation at INIST (experts in their domain) who used requests from end users. These topics were to cover numerous application fields, and to get a large number of relevant results in each corpus. The topics have been tested on the corpora to control their relevance. The query may have had to be modified, or some further details may have been needed.
4- Frames of reference for the answers
Answers' files contain for each numbered topic the numbers of all relevant documents. Some frames of reference for the answers were established before the participants proceeded to the tests. The answers had been selected by the providers (OFIL and INIST) with the appropriate methodology and adequate tools (initial frames of reference): they proceeded to a pre-selection of documents as extended as possible, based not only on their titles and summaries but also on the key words and classification codes used in the Pascal and Francis databases. These key words and classification codes can not be accessed by the participants. The results (a set of documents) are sorted manually, so that the results match the best the query.
The initial frames of reference were checked manually by the providers (INIST and OFIL), using the answers given by the participants. These answers were collected when the tests were finished. This allowed us to review and correct the frames of reference for the answers in order to give some even more detailed information for their content. The illustration below shows how the review was performed.
The 4 CDs contain each a corpus for the two phases of the two campaigns which took place.
TrecEval is also provided