Please, consider giving your feedback on using Lanfrica so that we can know how best to serve you. To get started, .
X

CMU Wilderness Multilingual Speech Dataset (Paper)

This paper describes the CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours of sentence-lengthed transcriptions. We describe our multi-pass alignment techniques and evaluate the results by building speech synthesizers on the aligned data. Most of the resulting synthesizers are good enough for deployment and use. The tools to do this work are released as open source, and instructions on how to apply such alignment for novel languages are given.


Link

CONNECTED RECORDS

LANGUAGES

TASKS

TAGS