MHATLex is a new enhanced lexical resource for written and speech automatic processing for French. It is derived from BDLex (see ELRA-S0004).
It contains three levels of representation:
- Syntactic level: S
- Phonological word level: W
- Phonetic level: P
At the W level, a word has two representations:
- input representation (W representation) where words are simply imported from the lexicon,
- output representation (W' or phonotypical) where words have the phonotypical representation imposed by their context in the sentence.
The lexicons contain inflected words (among which canonical words).
MHATLexSt (& BDLex) MHATLexW: about 50,000 entries (canonical) & 440,000 entries (inflected)
MHATLexW': about 81,000 entries (canonical) & 854,000 entries (inflected)
Words are represented with their orthography, pronunciation, morpho-syntactic features, and frequency indicator.
Only the pronunciation related part changes according to the lexicon (except if the user want to generate his own lexicon by skipping some features).
Four lexicons can be generated from MHATLex:
- MHATLexW : this is the central lexical resource which enables to generate the other lexicons
- MHATLexW' (or MHATLexPht) : gives the word representations for each pertinent context.
- MHATLexSt : with standard and simplified format of the pronunciation.
- BDLex (or BDLex50) : already distributed by ELDA (ELRA-S0003 and S0004). The current BDLex, derived from MHATLexW, contains some updates.
When purchasing MHATLex, the package includes BDLex (S0004: http://www.elda.org/catalogue/en/speech/S0004.html).
Integrity checks were made and the lexicon was parsed using nsgmls.
For more information: http://www.irit.fr/ACTIVITES/EQ_IHMPT/ress_ling/accueil01.php