Wolverhampton Business English Corpus

Full Official Name: Wolverhampton Business English Corpus
Submission date: Jan. 24, 2014, 4:33 p.m.

The WBE was created by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335). A survey of electronic language resources in the business domain carried out at Wolverhampton revealed that there are very few business corpora in existence, and almost none of them are widely accessible. There is significant demand for a business corpus, from both the NLP and pedagogic (language, business communication, and linguistics teachers and students) communities. The Wolverhampton Corpus of Written Business English is: - A synchronic corpus, including only texts available on the web during a 6-month period in 1999-2000 AD. - A monolingual English corpus: it comprises only texts written in English; but no restriction was applied as regards the variety of English used. On the contrary, the WBE deliberately tried to capture a wide range of varieties of English, by including documents from websites in Britain, USA, Pakistan, Netherlands, Belgium, Switzerland, Hong Kong, etc. - A written corpus: it contains only written materials. However, a few of the documents are transcripts of speeches. - A business corpus: the texts were selected manually, and care was taken to ensure that all the texts were from the business domain. The corpus consists of 10,186,259 words from 23 different Web sites The data can contribute to a wide range of NLP tasks, including information retrieval, information extraction, summarisation, etc. The WBE was built using materials solely from the Web. However, this does not mean that the corpus gives access only to a restricted range of categories of texts. On the contrary, the amount of information available online allowed us to select from a wide variety of categories. These range from product descriptions, company press releases, and annual financial reports, to business journalism, academic research papers, political speeches and government reports. The texts have been grouped according to the source site. The corpus is distributed in three formats. - The first one is the original encoding of the text. The majority of the texts are in HTML and plain text format. There are a few in PDF format or Microsoft Word DOC format. - The second format is plain text. The files were converted automatically if they were not in plain text format, and manually checked. - The corpus is also provided as SGML encoded files, using the Corpus Encoding Standard (http://www.cs.vassar.edu/CES/). The header of each file provides information about the title of the file, length in words, etc. The paragraph and sentence boundaries, and part of speech tags for each word are marked using SGML tags. All the available files were converted to 8-bit ASCII format using ISO 8859-1. Characters with ASCII codes from 127255 (also known as Extended ASCII) were manually checked in order to ensure the correct representation of the characters. The corpus was checked for spelling errors, but special care was taken to ensure that any variant spellings specific to the business domain were not wrongly corrected. A validation work was carried out by an external validator. It consisted of checking text files, tools, tagging and documentation.

Right Holder(s)