Wenzhou Spoken Corpus


Department of Linguistics, University of Alberta

Jingxia Lin and John Newman


The Wenzhou Spoken Corpus (WSC) has been developed by Jingxia Lin and John Newman in the Department of Linguistics, University of Alberta, with technical support from the Text Analysis for Research Portal (TAPoR) team. WSC is an online, searchable corpus of transcribed spoken Wenzhou data, consisting of six sub-corpora: Face to Face Conversation, Phone Call, Wenzhou News Commentary, Internet Chat, Story and Wenzhou Song. Most of the conversational data was collected in downtown Wenzhou and Yueqing city, from 2004 to the present. Spoken forms that lack a conventional representation by characters have been transcribed using phonetic transcription. The files have been marked up in XML. The current corpus (Version 1.0) consists of about 150,000 words and is continually being expanded. The addition of a statistical package is being planned.

Click here for current size of corpus and subcorpora