Wenzhou Spoken Corpus
Department of Linguistics, University of Alberta
Jingxia Lin and John Newman
ABOUT THE CORPUSThe Wenzhou Spoken Corpus (WSC) has been developed by Jingxia Lin and John Newman in
the Department of Linguistics, University of Alberta, with
technical support from the Text Analysis for Research Portal (TAPoR) team.
WSC is an online, searchable corpus of transcribed spoken Wenzhou data, consisting of six sub-corpora:
Face to Face Conversation, Phone Call, Wenzhou News Commentary, Internet Chat, Story and Wenzhou Song.
Most of the conversational data was collected in downtown Wenzhou and Yueqing city, from 2004 to the present.
Spoken forms that lack a conventional representation by
characters have been transcribed using phonetic transcription. The files have been marked up in XML.
The current corpus (Version 1.0) consists of about 150,000 words and is continually being expanded.
The addition of a statistical package is being planned.
Click here for current size of corpus and subcorpora