African Institutions Powering African NLP

Introduction

Across the African continent, dozens of universities, labs, and institutions are working to close the gap between African languages and modern language technologies. African languages are considered “low-resource” in NLP, which limits the development of speech-to-text systems, machine translation tools, and other AI-powered language applications. This is now changing. From East Africa to Southern Africa, local institutions are creating datasets, benchmarks, and tools that make African languages visible in global AI research.

Purpose of the guide

This guide highlights key institutions across Africa that are actively contributing to NLP initiatives. It serves as a roadmap for anyone interested in African language technologies, showing where to find resources, which universities and labs to engage with, and the contributions they are making.

Highlights of Some Institutions Doing Amazing NLP Projects

East Africa

Makerere University (Uganda) – Through its AI & Data Science Lab, Makerere has built text and speech datasets for Luganda, Kiswahili, Acholi, and Lumasaaba. Its collaboration with Mozilla Common Voice and Masakhane greatly expanded Luganda and Swahili voice data.
Maseno University (Kenya) – A collaboration between Maseno University, the University of Nairobi, and Africa Nazarene University produced the “KenCorpus” dataset, including parallel corpora for Swahili and Kenyan indigenous languages like Kidaw’ida, Kalenjin, and Dholuo.
University of Rwanda – TAIRI Lab – Builds capacity in responsible, inclusive AI. Current work spans AI-powered healthcare tools, climate-resilient agriculture, and air pollution forecasting.
University of Dodoma (Tanzania) – Hosts the AfriAI Lab, fostering capacity development and responsible AI research for African societal problems

West Africa

University of Ibadan & Afe Babalola University (Nigeria) – Created the first spoken corpus for the Igbo language (IgboSynCorp) with labelled/unlabeled data for machine translation, speech-to-text, and PoS tagging.
Ashesi University (Ghana) – In partnership with Nokwary Technologies, Ashesi developed a financial inclusion speech dataset for Ghanaian languages such as Akan (Akuapem Twi, Asante Twi, Fante) and Ga.
KNUST – Responsible AI Lab (Ghana) – Works on ethical AI aligned with the Sustainable Development Goals (SDGs), bridging digital skills gaps and influencing AI policy.
RobotsMali (Mali) – A centre democratising access to robotics and AI. Co-organised IndabaX Mali in 2025 with over 400 attendees.
University of Dakar(Senegal) – Partnered with Jokalante on the KALLAAMA Project to produce audio transcriptions for Wolof, Pular, and Sérèr.
IFRA Nigeria – Leads corpus-based studies on Nigerian languages, including NaijaSynCor (Naija Pidgin), Naija Archives, and CORPAFROAS Afroasiatic Spoken Corpus.

Central Africa

University of Zambia (Zambia) – Contributed to the Bemba Image Grounded Conversations (BIG-C) dataset, containing multi-turn dialogues in the Bemba language.

Southern Africa

North-West University – Centre for Text Technology (CTexT, South Africa) – Leads the Autshumato Project, which develops parallel corpora for machine translation. Also produced spelling checkers for 10 official SA languages, MT systems, OCR, PoS tools, and other core technologies in partnership with SADiLaR.
University of Pretoria (AI4D African Languages Lab) – Focuses on bias, inclusivity, data quality, and capacity building for African-language AI solutions. Works closely with the Data Science for Social Impact Lab.

Nothern Africa

Al Akhawayn University (Morocco) – Released the Moroccan Darija Offensive Language Detection Dataset, a 20,402-sentence, human-labelled corpus (in Latin and Arabic scripts) drawn from Twitter and YouTube comments, providing a much-needed resource for offensive-language and sentiment analysis research in Moroccan Darija.

Institutions and Their Key Resources

Region

Institution / Initiative

Key Resource / Project

Languages Covered

East Africa

Makerere University (Uganda)

Speech & text datasets; Mozilla Common Voice contributions

Luganda, Kiswahili, Acholi, Lumasaaba

East Africa

Maseno University (Kenya) + University of Nairobi + Africa Nazarene University

KenCorpus – parallel corpora for Kenyan indigenous languages

Swahili, Kidaw’ida, Kalenjin, Dholuo

East Africa

University of Rwanda – TAIRI Lab

Capacity building & responsible AI projects in healthcare, agriculture, and air quality

Multiple

East Africa

University of Dodoma (Tanzania) – AfriAI Lab

Capacity development and responsible AI research for African societal problems

Multiple

West Africa

University of Ibadan & Afe Babalola University (Nigeria)

IgboSynCorp – first spoken corpus for Igbo

Igbo

West Africa

Ashesi University (Ghana) + Nokwary Technologies

Financial inclusion speech dataset

Akan (Akuapem Twi, Asante Twi, Fante), Ga

West Africa

KNUST – Responsible AI Lab (Ghana)

Ethical AI aligned with SDGs; bridging the digital skills gap

Multiple

West Africa

RobotsMali (Mali)

IndabaX Mali + democratising robotics & AI

Multiple

West Africa

University of Dakar (Senegal) + Jokalante

KALLAAMA Project – audio transcriptions

Wolof, Pular, Sérèr

West Africa

IFRA Nigeria

Corpus-based studies: NaijaSynCor (Naija Pidgin), Naija Archives, CORPAFROAS (Afroasiatic Spoken Corpus)

Naija Pidgin, Afroasiatic languages

Central Africa

University of Zambia (Zambia)

BIG-C – Bemba Image Grounded Conversations dataset

Bemba

Southern Africa

North-West University – CTexT (South Africa)

Autshumato Project – parallel corpora, MT systems, OCR, PoS, spelling checkers

10 official SA languages

Southern Africa

University of Pretoria – AI4D African Languages Lab

Bias, inclusivity & data quality research; collaboration with Data Science for Social Impact Lab

Multiple

North Africa

Al Akhawayn University (Morocco)

Moroccan Darija Offensive Language Detection Dataset – 20,402 labelled sentences from Twitter & YouTube

Moroccan Darija

About Lanfrica Labs

About Lanfrica Labs | Lanfrica Docslanfrica-labs.gitbook.io

PreviousHighlight of Some NLP Models That Are Powering African Language Technology NextSome African NLP Repositories That Are Improving African Language Resource Availability

Last updated 1 month ago

Was this helpful?