African Institutions Powering African NLP
Introduction
Across the African continent, dozens of universities, labs, and institutions are working to close the gap between African languages and modern language technologies. African languages are considered “low-resource” in NLP, which limits the development of speech-to-text systems, machine translation tools, and other AI-powered language applications. This is now changing. From East Africa to Southern Africa, local institutions are creating datasets, benchmarks, and tools that make African languages visible in global AI research.
Purpose of the guide
This guide highlights key institutions across Africa that are actively contributing to NLP initiatives. It serves as a roadmap for anyone interested in African language technologies, showing where to find resources, which universities and labs to engage with, and the contributions they are making.
Highlights of Some Institutions Doing Amazing NLP Projects
East Africa
Makerere University (Uganda) – Through its AI & Data Science Lab, Makerere has built text and speech datasets for Luganda, Kiswahili, Acholi, and Lumasaaba. Its collaboration with Mozilla Common Voice and Masakhane greatly expanded Luganda and Swahili voice data.
Maseno University (Kenya) – A collaboration between Maseno University, the University of Nairobi, and Africa Nazarene University produced the “KenCorpus” dataset, including parallel corpora for Swahili and Kenyan indigenous languages like Kidaw’ida, Kalenjin, and Dholuo.
University of Rwanda – TAIRI Lab – Builds capacity in responsible, inclusive AI. Current work spans AI-powered healthcare tools, climate-resilient agriculture, and air pollution forecasting.
University of Dodoma (Tanzania) – Hosts the AfriAI Lab, fostering capacity development and responsible AI research for African societal problems
West Africa
University of Ibadan & Afe Babalola University (Nigeria) – Created the first spoken corpus for the Igbo language (IgboSynCorp) with labelled/unlabeled data for machine translation, speech-to-text, and PoS tagging.
Ashesi University (Ghana) – In partnership with Nokwary Technologies, Ashesi developed a financial inclusion speech dataset for Ghanaian languages such as Akan (Akuapem Twi, Asante Twi, Fante) and Ga.
KNUST – Responsible AI Lab (Ghana) – Works on ethical AI aligned with the Sustainable Development Goals (SDGs), bridging digital skills gaps and influencing AI policy.
RobotsMali (Mali) – A centre democratising access to robotics and AI. Co-organised IndabaX Mali in 2025 with over 400 attendees.
University of Dakar(Senegal) – Partnered with Jokalante on the KALLAAMA Project to produce audio transcriptions for Wolof, Pular, and Sérèr.
IFRA Nigeria – Leads corpus-based studies on Nigerian languages, including NaijaSynCor (Naija Pidgin), Naija Archives, and CORPAFROAS Afroasiatic Spoken Corpus.
Central Africa
University of Zambia (Zambia) – Contributed to the Bemba Image Grounded Conversations (BIG-C) dataset, containing multi-turn dialogues in the Bemba language.
Southern Africa
North-West University – Centre for Text Technology (CTexT, South Africa) – Leads the Autshumato Project, which develops parallel corpora for machine translation. Also produced spelling checkers for 10 official SA languages, MT systems, OCR, PoS tools, and other core technologies in partnership with SADiLaR.
University of Pretoria (AI4D African Languages Lab) – Focuses on bias, inclusivity, data quality, and capacity building for African-language AI solutions. Works closely with the Data Science for Social Impact Lab.
Nothern Africa
Al Akhawayn University (Morocco) – Released the Moroccan Darija Offensive Language Detection Dataset, a 20,402-sentence, human-labelled corpus (in Latin and Arabic scripts) drawn from Twitter and YouTube comments, providing a much-needed resource for offensive-language and sentiment analysis research in Moroccan Darija.
Institutions and Their Key Resources
East Africa
Makerere University (Uganda)
Speech & text datasets; Mozilla Common Voice contributions
Luganda, Kiswahili, Acholi, Lumasaaba
East Africa
Maseno University (Kenya) + University of Nairobi + Africa Nazarene University
KenCorpus – parallel corpora for Kenyan indigenous languages
Swahili, Kidaw’ida, Kalenjin, Dholuo
East Africa
University of Rwanda – TAIRI Lab
Capacity building & responsible AI projects in healthcare, agriculture, and air quality
Multiple
East Africa
University of Dodoma (Tanzania) – AfriAI Lab
Capacity development and responsible AI research for African societal problems
Multiple
West Africa
University of Ibadan & Afe Babalola University (Nigeria)
IgboSynCorp – first spoken corpus for Igbo
Igbo
West Africa
Ashesi University (Ghana) + Nokwary Technologies
Financial inclusion speech dataset
Akan (Akuapem Twi, Asante Twi, Fante), Ga
West Africa
KNUST – Responsible AI Lab (Ghana)
Ethical AI aligned with SDGs; bridging the digital skills gap
Multiple
West Africa
RobotsMali (Mali)
IndabaX Mali + democratising robotics & AI
Multiple
West Africa
University of Dakar (Senegal) + Jokalante
KALLAAMA Project – audio transcriptions
Wolof, Pular, Sérèr
West Africa
IFRA Nigeria
Corpus-based studies: NaijaSynCor (Naija Pidgin), Naija Archives, CORPAFROAS (Afroasiatic Spoken Corpus)
Naija Pidgin, Afroasiatic languages
Central Africa
University of Zambia (Zambia)
BIG-C – Bemba Image Grounded Conversations dataset
Bemba
Southern Africa
North-West University – CTexT (South Africa)
Autshumato Project – parallel corpora, MT systems, OCR, PoS, spelling checkers
10 official SA languages
Southern Africa
University of Pretoria – AI4D African Languages Lab
Bias, inclusivity & data quality research; collaboration with Data Science for Social Impact Lab
Multiple
North Africa
Al Akhawayn University (Morocco)
Moroccan Darija Offensive Language Detection Dataset – 20,402 labelled sentences from Twitter & YouTube
Moroccan Darija
About Lanfrica Labs
Last updated
Was this helpful?