Kiswahili TTS Dataset

The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.

Link

LANGUAGES

TASKS

text to speech