Corpus of Spontaneous Japanese (CSJ)

Full Official Name: Corpus of Spontaneous Japanese (CSJ)
Submission date: Oct. 2, 2023, 3 p.m.

The "Corpus of Spontaneous Japanese" (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data. The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation. The whole CSJ contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech materials are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed using a two-way transcription scheme designed especially for CSJ. Also, POS (part-of-speech) analysis based upon two different kinds of 'word' is applied for the whole corpus. Recorded speech is transcribed in two different ways: orthographic and phonetic transcriptions: - In "orthographic" transcription, speech is transcribed using Kanji (Chinese logograph) and Kana (Japanese syllabary) just like ordinary Japanese text, but unlike the ordinary Japanese writing, the orthographic transcription has rigorous rules about the usage of Kanji and Kana letters. In ordinary text, for example, there are more than five ways of transcribing the phonemic string of /hanasiai/ ("meeting") using Kanji and Kana, but in the CSJ orthographic transcription, only one is allowed. - "Phonetic" transcription is written exclusively in Kana letters so that the phonetic details of the utterance being transcribed can be traced. There is a true subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech. Core is the part of CSJ to which the cost of annotation is concentrated. In addition to the two-way transcription and two-way POS analysis, segment label, intonation label, and other miscellaneous annotations are provided for the Core.

Creator(s)
Distributor(s)
Right Holder(s)