Some African NLP Repositories That Are Improving African Language Resource Availability

Introduction

Many African languages are classified as low-resource in today’s digital and technological landscape, resulting in significant gaps in the development and deployment of language technologies on the continent. For instance, Nigerian Pidgin, spoken by over 100 million people, has less than 21 minutes of available speech datasets, a glaring example of underrepresentation. As advances in natural language processing (NLP) accelerate globally, African technologists and researchers have proactively taken on the challenge of creating vital resources, tools, models, and solutions to ensure that African languages gain meaningful representation.

This effort is essential to serve the more than 2 billion Africans who speak and use these languages daily. Despite international repositories hosting resources for fewer than 30 African languages, significant progress is being made locally. African technologists, academics, and communities are building rich, nuanced repositories that reflect the context and complexity of these languages, boosting accessibility and facilitating faster innovation. Consequently, more African languages are gaining visibility in the digital world, driving the development of contextually aware and inclusive NLP solutions.

Within the continent, resource availability varies widely. The Niger-Congo language family languages like Kinyarwanda, Swahili, Hausa, Yoruba, and Igbo, benefits from thousands of hours of speech data. Conversely, languages such as Sango and Oromo have far fewer spoken datasets, underscoring the unevenness that still needs to be addressed.

Purpose of the Guide

This guide aims to highlight prominent African repositories and projects that empower the creation, curation, and distribution of African language resources to secure their place in the digital realm. African-focused repositories provide localized knowledge, broader language coverage, and deeply engaged communities supporting the resources they host. Continued support for these languages and the expansion of their digital resources are critical to delivering more effective and relevant technological solutions that address Africa’s unique needs. Understanding where to contribute resources, what has already been developed, and identifying gaps is only possible with clear visibility into these repositories. This guide provides that crucial visibility, fostering collaboration and accelerating the advancement of African language NLP.

Highlight of Repositories

Repository: A centralised digital hub that hosts datasets, models, tools, research papers, and other resources, providing a structured space where they are stored, organised, and shared for long-term access and reuse.

Most of these repositories have their resources linked on Lanfrica.

AfricArXiv

AfricArXiv is a community-led digital archive and open science platform for African research resources, including manuscripts, reports, datasets, code, visualisations, and presentations. In 2023, AfricArXiv introduced persistent identifiers and licensing options aligned with international standards to safeguard copyrights while promoting open access.

Target: African researchers, scientists, policy makers and collaborators inside and outside Africa working on African-relevant research.

Platform Language support: Akan, Twi, Swahili, Zulu, English, French, Portuguese.

Lacuna Fund Language Domain Resources

Lacuna Fund is a funding and support initiative that collaborates to create and organise labelled data for AI and data science projects in low- and middle-income countries, including Africa. It focuses on the domains of Agriculture, Language, Health, and Climate.

They host openly accessible datasets and speech corpora for low-resource African languages.

Some interesting resources in the repository:

Machine Translation benchmark datasets for Horn of Africa languages
Nigerian Twitter Sentiment Corpus (multilingual sentiment analysis)
BIG-C multimodal dataset for Bemba
Hate and offensive language detection datasets

Target: Researchers and developers seeking to develop ethical and impactful AI solutions for African contexts.

Sadilar

Sadilar is a repository and platform dedicated to electronic text, speech data, multilingual resources, tools, and applications for South African and African official languages. Supported by the South African Department of Science, Technology and Innovation (DSTI), it promotes research and development of NLP technologies like machine translation.

Coverage: All official South African languages alongside other African languages.

Target: Researchers, linguists, technologists, and academicians.

Andrew2017 GitHub repo

A curated repository of African NLP datasets, benchmarking corpora, and models covering tasks such as:

Sentiment analysis in Tunisian Arabic
Amharic summarisation benchmarks
ASR datasets for Fon and Sepedi languages
Mboshi-French parallel corpus for speech

South African Journal of African Languages

A peer-reviewed research journal dedicated to the advancement of Bantu and Khoisan languages and literature. It publishes academic papers, book reviews, and polemic contributions in linguistics and literature.

Acceptance rate: Approximately 70%.

Access: Hybrid open access to boost discoverability and highlight the intellectualisation of African languages.

Target audience: Researchers, students, and educators of African languages and literature.

Ghana NLP

An open-source, community-led initiative focused on Natural Language Processing for Ghanaian languages and practical applications addressing local challenges. The community has developed APIs for translation, transcription, and text-to-speech for African languages.

Some interesting resources:

Kasa – English to Twi translation system
ABENA – Training scripts for language models like BERT, IMBERT, and distilBERT
Ghanaian NLP Dataset Models – Datasets and models for Akan, Dagbani, and Ga languages

Target Audience: Researchers and developers working on Ghanaian language NLP.

Masakhane

Masakhane is a grassroots NLP community led by Africans, for African languages. Their GitHub repository houses a rich collection of tools, datasets, and language models.

Notable contributions:

Masakhane POS dataset – Part-of-speech tagging for 20 typologically diverse African languages (e.g., Bambara, Ewe, Fon, Luganda, Wolof, Zulu).
LM Evaluation Framework – A framework enabling few-shot evaluation of language models with over 60 academic benchmarks for large language models.
AfriQA – Cross-lingual open-retrieval question answering dataset with 12,000+ examples across 10 African languages, including Hausa, Kinyarwanda, Yoruba, and Bemba.
MasakhaNER – Named entity recognition data for 20 African languages, such as Ghomala, Chichewa, Amharic, Nigerian Pidgin, and Swahili.

Additional resource:

A GitHub repository by @dadelani documenting work with LSV and Masakhane, including MENYO 20k, a multi-domain English-Yoruba corpus, and pre-trained models like AfriMBART and AfriByT5.

Repository

Scope

Highlights

Resource Types

Update Frequency

Audience

AfricArXiv

African research resources

Open science platform, persistent IDs, licensing

Papers, posters, projects

Actively maintained by community

Researchers, policymakers

Lacuna Fund Language Domain

AI data for low-resource contexts

Diverse social impact datasets, funding support

Datasets, projects, media, models

Ongoing, community-supported

Researchers, developers

Sadilar

South African official languages

Government backing, broad language coverage

Papers, projects, models, datasets, software

Government-supported, regularly updated

Researchers, linguists, technologists

Andrew2017 GitHub repo

African NLP datasets and benchmarks

Large, diverse language sample datasets

Software, datasets, models, projects

Periodic community contributions

Researchers, developers

South African Journal of African Languages

African languages literature

Papers

Paper, posters

Periodic publishing

Researchers, students, educators

Ghana NLP

NLP for Ghanaian languages

Dataset, Models, Software, Projects

Datasets, software, models, projects

Active community-led updates

Researchers, developers

Masakhane

NLP community for Africa

Datasets, Models, Software, Projects, Media

Tools, datasets, models, projects

Very active community, ongoing updates

Researchers, developers

Conclusion

We have made remarkable progress in African language NLP, with communities like Masakhane achieving outcomes that were thought to be far off just a few years ago. At Lanfrica, our mission is to improve the discoverability of African language resources, guiding communities, researchers, and academics to focus their efforts on under-resourced languages to ensure a more balanced representation among Africa’s 2,500+ languages. Despite these strides, much work remains to be done.

Many African repositories are actively maintained by passionate NLP communities across Africa, which continuously call for new resources and contributions. Staying engaged with these communities and responding to their needs can accelerate efforts to build robust multilingual support for African AI solutions. While some African datasets exist in global repositories, they often lack the dedicated African community backing necessary for sustained growth and contextual relevance. Strengthening and supporting African-led repositories is key to advancing inclusive, effective language technologies for the continent.

About Lanfrica Labs

About Lanfrica Labs | Lanfrica Docslanfrica-labs.gitbook.io

PreviousAfrican Institutions Powering African NLP NextSome African NLP Datasets That You Can Use To Build African AI

Last updated 3 months ago

Was this helpful?

Introduction

Purpose of the Guide

Highlight of Repositories

AfricArXiv

Lacuna Fund Language Domain Resources

Sadilar

Andrew2017 GitHub repo

South African Journal of African Languages

Ghana NLP

Masakhane

Conclusion

About Lanfrica Labs