EnToFrNE - a Parallel English-French Lexicon of Named Entities

Full Official Name: EnToFrNE - a Parallel English-French Lexicon of Named Entities
Submission date: Sept. 10, 2019, 2:25 p.m.

In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities, which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on. They are often denoted by proper names and can be abstract or have a physical existence. Examples of named entities include: United States of America, Paris, Google, Mercedes Benz, Microsoft Windows, or anything else that can be named. Certain natural terms like biological species and substances, which are sometimes considered named entities, are not included in the lexicon. The lexicon consists of 1,167,263 parallel named entities in English and French. Classification Named entities in the lexicon are tagged. The tags used are: PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. Each named entity belongs to one of these classes. The classes comprise: PERSON: humans, gods, saints, fictional characters; ORGANIZATION: political organizations, companies, schools, rock bands, sport teams; LOCATION: geographical terms, fictional places, cosmic terms; PRODUCT: industrial products, software products, weapons, art works, documents, concepts, standards, laws, formats, anthems, algorithms, journals, coats of arms, platforms, websites; MISC: events, languages, peoples, tribes, alliances, orders, scientific discoveries, theories, titles, currencies, holidays, dynasties, positions, projects, historical periods, battles, competitions, alliances, deceases, breeds, programs, set of locations, awards, musical genres, missions, artistic directions, set of organizations, networks. There are 1,167,263 entries in the lexicon. At least one tag is assigned to each one of them. The distribution of tags is as follows: PERSON: 387,676 ORGANIZATION: 107,865 LOCATION: 309,533 PRODUCT: 149,137 MISC: 247,655 The total number of tags, 1,201,866, is slightly higher than the number of entries, due to the fact that some named entities may belong to more classes. For example, Tom Sawyer is tagged as both PRODUCT (the title of the novel) and PERSON (the character from the novel). Evaluation To evaluate the tagging, two common metrics in information retrieval have been used: precision and recall. Precision means the percentage of tags which are correct. On the other hand, recall refers to the percentage of total relevant tags correctly classified by the algorithm. An alternative to having two measures is the F-measure which combines precision and recall into a single performance measure. This metric is known as F1-score, which is simply the harmonic mean of precision and recall. In order to evaluate the tagging, a random sample containing 1,000 entries has been extracted from the lexicon. The entries from the sample have been tagged manually and then compared to the tagging performed by the algorithm. The precision of tagging is between 0.94 for ORGANIZATION and 0.99 for PERSON. The recall is slightly lower, from 0.83 for PRODUCT and MISC to 0.97 for PERSON. The higher values of precision show that the tagging algorithm was adjusted to tag the named entities correctly, rather than to extract more named entities for the lexicon. Formats The lexicon comes in two formats: csv and xml. The first row in the csv file is a title row and tab is used as a field separator. The columns’ titles are: en, fr, PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. Next rows contain the data: English name, French name and five digits, 0’s or 1’s, depending on which class the named entity belongs to. The structure of the xml file is similar. The columns’ names from the csv file are now names of elements.

Right Holder(s)