Some African NLP Datasets That You Can Use To Build African AI

You're in a hospital in Lagos, and a medical AI chatbot mistranslates a prescription from English to Yoruba, changing "take twice daily" to "take twenty daily." Or imagine a farmer in rural Kenya trying to get weather updates through a voice assistant that simply doesn't understand their accented English. These aren't hypothetical scenarios; they're the daily reality of AI's massive blind spot when it comes to African languages. AI systems were primarily trained on languages like English, French, and Mandarin, leaving Africa's thousands of languages and dialects in the digital shadows.

However, that narrative is now swiftly changing, driven by a powerful grassroots movement focusing on the very foundation of AI data. Across the continent, researchers, communities, and innovators are meticulously building the linguistic raw material necessary to train AI systems, fundamentally reshaping how African voices are represented and understood in our digital world. These efforts are powering everything from accurate machine translation to life-saving medical diagnostics.

At Lanfrica, we are dedicated to connecting, organizing, and archiving these vital efforts, ensuring they are easily discoverable. Below, we highlight a selection of these initiatives. For the comprehensive list, we invite you to explore the Lanfrica discovery platform: https://lanfrica.com/en

The Masakhane Initiative: A Grassroots Movement Ensuring African Voices Build AI

When we talk about African NLP, the name Masakhane jumps to mind. It's a vibrant, community driven, African led, open source movement dedicated to ensuring Africans are the architects of AI, not merely its consumers. Masakhane houses an eclectic collection of African language repositories.

  • MasakhaNER 2.0: Named Entity Recognition (NER) is crucial for teaching AI to identify and categorise important terms like names, places, or medical conditions. MasakhaNER 2.0 is the largest human-annotated dataset for NER across 20 African languages, providing between 4,800 and 11,000 parallel sentences per language for training and evaluation. Languages covered span West, Central, East, and Southern Africa, including Bambara, Ghomala, Ewe, Fon, Hausa, Igbo, Kinyarwanda, Luganda, Dholuo, Mossi, Chichewa, Nigerian Pidgin, chiShona, Setswana, Swahili, Twi, Wolof, isiXhosa, Yorùbá, and isiZulu.

  • It powers models such as PuoBERTa NER for Setswana and the FonMTL multi task learning models for Fon.

  • The Hausa VoA NER and isiXhosa NER Corpus are also integrated into the Inkuba Instruct dataset.

  • MasakhaPOS is the largest publicly available, high quality dataset for Part of Speech (POS) tagging across 20 typologically diverse African languages. It helps models understand the grammatical structure of these languages.

  • It is utilised by models such as FonMTL and integrated into the Inkuba Instruct dataset. While valuable, its current limitation to news text means broader applicability to other domains remains an area for further development.

  • MasakhaNEWS is the largest publicly available dataset for news topic classification across 16 widely spoken African languages. It enables models to categorise news articles into topics like business, health, politics, religion, sports, entertainment, and technology, and includes languages such as Amharic, English, French, Hausa, Igbo, Luganda, Kiswahili, and Yorùbá.

  • It has proven particularly valuable for transfer learning, as demonstrated by the davidshulte/ESM masakhane masakhanews model, which uses it as an intermediate training step to improve overall performance.

  • It is also integrated into the Inkuba Instruct dataset.

Conversations in Context: Empowering AI to Ask and Answer in African Languages

  • Asking Questions, Getting Answers with AfriQA: Imagine a search engine working seamlessly in Yoruba or isiZulu, understanding cultural context rather than just literal translations. AfriQA, also known as AfriQa Gold Passages, is the first cross-lingual question answering (QA) dataset for African languages, with over 12,000 XOR QA examples spanning 10 languages. This resource is crucial for developing more equitable and robust QA technologies.

  • It is already integrated into the Inkuba Instruct dataset for question answering in languages like Hausa, Yoruba, Swahili, isiZulu, and isiXhosa.

  • The Swahili Question Answering Dataset for Horticulture was created to contribute to Swahili language resources for natural language processing tasks. It is specific to the horticulture domain and can be used for machine reading comprehension tasks.

Giving AI African Voices: The Speech Revolution

Text is only half the story; speech technology is vital, especially on a continent where mobile phones are prevalent and oral traditions remain strong. These datasets are directly addressing AI's historical biases.

  • Fixing Accent Bias with AfriSpeech 200: Most commercial voice recognition systems significantly struggle with African-accented English, primarily because they were trained on Western voices. AfriSpeech 200 is the first and most diverse open source Pan African accented English speech corpus, comprising over 200 hours of speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries.

  • It includes both clinical and general domain content, meticulously embedding African-centric proper nouns, tropical diseases, and medications by integrating biomedical datasets like PubMed and NCBI disease.

  • When models like Whisper and Wav2vec2 were fine-tuned on AfriSpeech 200, they showed dramatic improvements in accuracy, with the xlsr 53 model achieving a 49.1% relative improvement and Whisper medium showing a 34.9% improvement. This directly reduces racial bias in commercial ASR systems and has profound implications for healthcare and education. However, a limitation is that all audio samples are read speech, which may underperform with spontaneous conversational speech, and it currently excludes North African accents.

  • Intron Health uses the AfriSpeech 200 dataset to train AI models for Accented Speech Recognition (ASR) in clinical settings, enabling their Transcribe app to accurately convert dictated medical notes from over 200 African accents into text. This technology helps reduce the paperwork burden on healthcare professionals in Africa, increasing efficiency and accessibility for medical documentation.

  • Community Power with Mozilla Common Voice: If AfriSpeech 200 is a research milestone, Mozilla Common Voice is a genuine grassroots revolution. This open-source, crowd-sourced platform allows anyone to contribute by recording their voice or validating others' recordings. This collective effort has brought languages like Kinyarwanda, Kiswahili, Luganda, Hausa, Xhosa, Twi, Kalenjin, Setswana, Dholuo, IsiNdebele (Southern Ndebele), and Southern Sotho into the dataset. For instance, the Speech Recognition Dataset in Kinyarwanda from Common Voice platform has 2,002 hours of validated speech. The impact is immense:

  • In Rwanda, Digital Umuganda used Kinyarwanda Common Voice data to build the Mbaza AI Chatbot for COVID-19 information, deployed on government healthcare hotlines.

  • In Kenya, Ujuzi Craft/ChamaChat developed a Kiswahili Chama management system with voice capabilities, integrating with the M Pesa API for financial services.

  • Tech Innovators Network Ltd also created Paza Sauti, a voice-enabled chatbot for business registration and credit access awareness.

  • Strathmore University developed Imarika, a conversational climate advisory chatbot in English and Swahili for smallholder farmers.

  • In Tanzania, Sustain Earth's Environment Africa (SEE Africa) created "Kiazi Bora" (Quality Potatoes), a voice-enabled agricultural app providing farming tips to rural women in Kiswahili.

  • Duniacom Group LLC built the Kiswahili Text and Voice Recognition Platform (KTVRP) for agricultural and financial services.

  • The University of Westminster/Moi University (Kenya/UK) developed Wezesha na Kabambe, a mobile-enabled Swahili audio chatbot for agricultural information that works on feature phones without internet connectivity.

  • Core23Lab (Democratic Republic of Congo) developed "Haki des femmes", a voice technology platform to provide legal information for women's land rights.

  • Bridging Francophone Africa with African Accented French: This corpus consists of approximately 22 hours of speech recordings with transcripts, collected from Yaoundé (Cameroon), Libreville (Gabon), and Niamey (Niger). It addresses the nuanced phonetic characteristics distinguishing African French from European varieties, which is essential for effective voice technology in vital services like healthcare and education.

  • It has been integrated into training datasets exceeding 2,500 hours for advanced French speech-to-text models, including distilled Whisper variants like bofenghuang/whisper large v3 distil fr v0.2, bradenkmurray/faster whisper large v3 french distil dec16, and aTrain core/whisper large v3 french distil dec16, as well as the fine-tuned NVIDIA French FastConformer (linagora/linto stt fr fastconformer).

Domain Specific Innovation: Where Precision Truly Matters

Some datasets home in on specific, high-stakes domains where accuracy is paramount.

  • Global Ambitions, Local Challenges: NLLB and Flores 200: Meta AI's No Language Left Behind (NLLB) project is an ambitious effort to enhance machine translation for low resource languages, supporting over 40 African languages and achieving a 40% improvement in BLEU scores compared to previous models. Flores 200 serves as a pivotal multilingual evaluation dataset, covering over 40,000 translation directions, including many African languages.

  • NLLB Seed is a dataset for training low resource MT models, containing human translated African medical text. Despite this breadth, NLLB’s 200-600M seed exhibited critical failures in healthcare translations, such as mistranslating Swahili medical dosage instructions, which poses risks for unsafe medication use. Furthermore, its automatic toxicity detection mechanism disproportionately flagged African language translations as unsafe, revealing deep seated cultural and linguistic biases.

  • The isiZulu Afrimmlu translate test and Afrimgsm translate test datasets are evaluation datasets comprising translations from 16 African languages into English using NLLB.

  • The "Correcting FLORES Evaluation Dataset for Four African Languages" project also specifically focuses on improving the accuracy and reliability of machine translation evaluation for Hausa, Northern Sotho, Xitsonga, and isiZulu by correcting inconsistencies in the FLORES dataset.

  • Infant Health Diagnostics with Ubenwa Dataset: The Ubenwa dataset is a unique audio-based, clinically validated resource for infant cry analysis, used by Nigerian startup Ubenwa AI to detect birth asphyxia, a leading cause of infant mortality.

  • The Ubenwa AI model achieved impressive diagnostic accuracy with 85% sensitivity and 89% specificity. However, its performance significantly declined when analysing cries in Nigerian Pidgin and Hausa due to English-centric NLP training. The system's opaque decision making logic also led to rejection by clinicians in some Nigerian hospitals, highlighting the vital need for explainable AI in medical contexts.

  • WURA: This document level dataset covers 16 African languages and four high-resource languages (English, French, Arabic, Portuguese). It was created by carefully auditing existing corpora like mC4 and crawling verified news sources to address quality issues in web-crawled data for low-resource languages.

  • WURA played a foundational role in training AfriTeVa V2, with the castorini/afriteva v2 base model showing improved performance across text classification, machine translation, summarisation, and cross-lingual question answering.

  • It is also leveraged for improved tokenisation for the AmhT5 tokenizer for Amharic and English.

  • Inkuba Instruct: Developed by Lelapa AI, this dataset represents a new approach to African NLP by compiling downstream datasets for five African languages across multiple NLP tasks, serving as a vital resource for instruction fine tuning language models. It integrates data from various public repositories for tasks like Machine Translation (WMT 22 African, Mafand MT, Menyo 20k), Named Entity Recognition (MasakhaNER2, Hausa VoA NER, isiXhosa NER Corpus), Part of Speech tagging (MasakhaPOS), Question Answering (AfriQA), Topic Classification (SIB 200, MasakhaNEWS, Hausa News Classification), and Sentiment Analysis (AfriSenti, NaijaSenti, Swahili Tweet Sentiment).

  • Pula 8B, a significant model from the BOTS LM suite, was fine-tuned on Inkuba Instruct to perform tasks like Setswana English translation, writing, QA, NER, and POS tagging.

  • Similarly, Dineochiloane/gemma 3 4b isizulu inkuba v2, an improved isiZulu to English translation model, was trained using examples from the isiZulu train split of Inkuba Instruct.

Notable Datasets and Resources

Beyond these major initiatives, several other datasets contribute to the evolving African NLP landscape:

Parallel corpora for MT across 38+ African languages.

High-quality parallel sentences in 16 African languages (news domain).

Multi-domain English–Yoruba MT corpus (20k+ pairs).

Parallel MT dataset (~23k Ewe, ~53k Fongbe sentences).

10k Yoruba–English sentence pairs from news, books, proverbs, etc.

15k English–Luganda parallel corpus (Makerere University).

MT benchmark dataset for Horn of Africa languages.

Bible translations covering 100+ languages, including many African languages.

English texts translated into Acholi, Runyankore, Luganda, Lumasaba, and Swahili.

117k+ Fon–French parallel sentences for MT.

Largest human-annotated NER dataset (20 African languages).

POS tagging dataset for 20 African languages.

Universal Dependencies treebank for 11 African languages.

NLI benchmark: translations of XNLI into 16 African languages.

Reading comprehension dataset for Nigerian languages.

Cross-lingual QA dataset with 12k+ examples in 10 African languages.

200+ hours of accented African English speech for ASR.

Speech dataset for French with African accents.

148 hours of speech in Ghanaian languages (Akan, Ga) for financial services.

40 hours of spoken Igbo across dialects.

Community-contributed speech datasets in multiple African languages.

Bible readings in 700+ languages, including African.

Google’s open dataset for speech translation (includes African languages).

Commercial service creating/transcribing African speech datasets.

Community-driven speech dataset (Swahili, Luganda, Kinyarwanda, Runyankore, Nigerian accents).

1,183 hours of Kinyarwanda speech collected via Common Voice.

Speech datasets in Wolof, Pulaar, Sereer (agriculture domain).

First Tunisian Arabizi sentiment dataset.

110k+ annotated tweets across 14 African languages.

Multilingual hate speech and abusive language dataset.

Stance detection dataset in Zulu.

News classification in 16 African languages.

31k Swahili news articles across multiple categories.

News articles from Malawian publishers.

Headlines in Setswana and Sepedi.

South African gov’t speeches + newspaper data in 11 official languages.

Global topic classification benchmark (includes African languages).

Digital Luganda dictionary (archived).

List of stop words for Swahili text processing.

Swahili slang → proper word mappings.

Common typos and corrections in Swahili.

Alternative stopword list for Swahili.

Large-scale web-crawled dataset for West African languages.

Information retrieval benchmark for African languages.

COVID-19 translation dataset (35 languages incl. African).

Corrected the FLORES evaluation dataset for 4 African languages.

Dialogue datasets (6 African languages) adapted from MultiWOZ2.2.

Multilingual parallel corpus for low-resource languages.

About Lanfrica

Lanfrica is an open-access, centralised catalogue dedicated to African language technology resources. We link papers, datasets, benchmarks, models, and research projects. We make it significantly easier for researchers, academics, technologists, and policy makers to discover and reuse these resources. Instead of searching across multiple scattered repositories, users can browse our platform to find African-language models, corpora, or evaluation sets, accelerating collaboration and innovation across the continent.

Last updated

Was this helpful?