Kencorpus: Kenyan Languages Corpus for Machine Learning and Natural Language Processing

This project collected text and speech corpora for three languages in Kenya: Kiswahili, Dholuo and 3 Luhya dialects (Lumarachi, Logooli and Lubukusu). Primary data was collected from the respective language communities, which included Indigenous stories and narratives from student compositions, native language media stations, and publishers – in order to include genres of texts representative of everyday language use in the communities. A total of 4,442 texts were collected: 2909 for Swahili, 546 texts for Dholuo, 483 texts for Lumarachi, 135 texts for Lubukusu, and 359 texts for Logooli. A total of 1,152 files containing spontaneous speech data were collected, which total to 176 hours, 29 minutes, and 46 seconds: 104 files (19 hours, 10 minutes, 57 seconds) for Swahili, 512 files (99 hours, 3 minutes, 8 seconds) for Dholuo, 138 files (15 hours, 37 minutes, 46 seconds) for Lumarachi, 354 files (30 hours, 11 minutes) for Lubukusu, and annotated 44 files (12 hours, 26 minutes, 55 seconds) for Lulogooli.