Cookies are used on the Lanfrica website to ensure you get the best experience.
This speech dataset includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers. In total this dataset includes 27 hours 31 minutes 50 seconds of speech data from 26 speakers, that is, 19 females and 7 males. The recordings are of the following audio format: .wav, 16bits, 16kHz, mono and Little Endian. Of the total recordings 26 hours 32 minutes and 37 seconds represent the read speech data while 59 minutes 13 seconds represent the spontaneous speech recordings. Each audio file has a corresponding transcript, for example, an audio file named tweet_5701.wav in audios folder corresponds to the transcript file tweet_5701.txt in the transcripts folder. Additionally, this dataset includes a phonelist file kencorpus.phone containing all the Swahili phones as used by KenCorpus. This phonelist file is crucial as its contents have been used to create the KenCorpus Swahili lexicon-phone dictionary kencorpus.dic which contains all the words in the KenCorpus transcripts with their corresponding pronunciations as per the Swahili phones in the phonelist. The lexicon-phone dictionary contains over 30,000 words. Acknowledgement of data curators: Dorcas Awino, Dr. Benard Okal, Khalid Kitito, Owiny Japheth Otieno