Motivation
Our digital world is a rich tapestry of ideas, languages, cultures, and knowledge. However, our access to and understanding of these resources is skewed; some gain significant visibility, while others remain under-represented, and obscure (even when available on the web). Our understanding is largely defined by what’s findable. In today’s fast-paced digital age, online discoverability is essential: if information cannot be found, it is often perceived as nonexistent and consequently under-utilized.
Despite Africa’s linguistic richness, language technologies largely exclude African languages, with AI tools supporting only a fraction; popular voice technologies (Siri, Alexa, and Google Home) do not support a single African language. The marginalization of African languages in language technologies is the symptom, while the root of the problem is the lack of discoverability of African (and marginalized) resources – i.e. the African language technology resources (e.g. keyboards, dictionaries, datasets, linguistic resources, etc.) needed to create the technologies, are not findable to those who need them.
The Problem
Empirical, quantitative analyses of African language datasets is almost non‑existent, leading to a lack of public understanding. For example, we still cannot say, with confidence, how many hours of Chichewa (or Gikuyu, or any African language you care about) speech exist or how much Yoruba, Dagaare, or Pulaar text datasets is available. This lack of empirical analysis stalls innovation as it leads to reinventing the wheel (if we don’t know what’s out there how can we be sure we are not repeating past efforts), and is causing an alarming trend where a few, often popular, African languages receive all the attention, dataset creation efforts, funding, while the majority of what we call “tail-end” African languages receive less, and for some languages, completely ignored.
A quantitative empirical analysis of available African language will improve public understanding, guide policymakers, and address a key barrier to technology development we’ve identified: having to spend excessive time and resources just to identify which datasets exist for a specific use case.
Embarking on a bold solution together in 2025
Our mission at Lanfrica is to accelerate public understanding and utilization of African resources (including African language datasets). We do this by finding, organizing and “linking” African resources on the platform. With our community, we are on a mission to “link” all African language resources, one at a time. Our platform has indexed more than 80,000 African resources, and that number keeps growing
Ever since we launched v1 of Lanfrica, the vision has always been to provide something indispensable to the public that no one has been able to do: quantitative, evidence-based insights about the African datasets out there. Due to our unique approach to tackling discoverability of African resources, we are best positioned to aggregate the statistics and finally put answers to questions like “how many speech datasets are there for {insert_your_African_language_of_choice}?”. We did something like that with our language highlight initiative back in 2022. Beyond that, by organizing and structuring the linked records, we would be able to provide insights on the quantity of available African language datasets.
In 2025, we are embarking on an ambitious goal to perform a thorough empirical analysis of African language datasets (with a focus on speech and text), culminating in the first-of-its-kind report on the state of African language datasets. This project will involve multiple stakeholders, community members, people, etc. The overall goal is to improve public understanding of the state of African language datasets (focusing on speech and text in this iteration).
How do we hope to achieve this
The work package, at a high level, can be divided as follows:
Finding African language datasets: we plan to intensify our efforts at finding and connecting African language datasets out there. This is a community effort, and we need help from the community. The more datasets we are able to link, the richer the empirical analysis.
Annotation: once the resources are identified, a key part of the project involves closely examining each one and annotating it with detailed information. This granular data will serve two main purposes: 1) it will form the foundation for the analysis and insights presented in the report, and 2) it will be displayed on the Lanfrica platform, giving users a richer understanding of the scope and coverage of African datasets.
To optimize this process, we will build a lightweight annotation platform within Lanfrica. This tool will streamline the annotation workflow, making it easier for contributors and collaborators to tag, verify, and enrich dataset entries efficiently.
Analysis: from the annotations, the next part involves compiling them, finding out the interesting stories they tell, and figuring out how to share them with the world.
Reporting: the final stage involves consolidating all previous efforts into a tangible output. A key consideration here is that we won’t be able to capture every existing African dataset in this initial version—some will inevitably be missed. So, we must design the output in a way that allows for ongoing updates as new data becomes available.
We are planning two versions of the report to serve different audiences:
- A research paper (similar to this paper by the Data Provenance Initiative) that details the project, methodology, and includes rich narratives and insights from our analysis.
- A high-level report tailored for policymakers, government officials, and decision-makers, highlighting key findings and actionable takeaways.
The future
Just as Ethnologue has become a reference about the languages of the world, and Wikipedia for general knowledge, Lanfrica aspires to be the trusted reference for understanding African language datasets (and African resources in general).
To achieve this, we intend to continue this effort, and publish annual reports on the state of African language datasets, starting with 2025, then 2026, 2027, and so on. This launch is only the beginning: it marks the first iteration of an evolving project. As the work progresses, we’ll continue to update and refine the analysis, incorporating newly discovered or previously missed datasets, allowing the insights to grow deeper and more comprehensive over time.
Important points about the project
Some important things to note about this kickoff project (we’ll be adding more as we decide on them):
- For this iteration, we are focusing on datasets that have been hosted somewhere on the internet. So the dataset should have a URL to find it online. Future iterations will consider un-digitized resources, as part of our growth strategy.
- We are including datasets of all types—whether open access, closed, restricted, or anything in between. At Lanfrica, we believe that a dataset doesn’t have to be open access or meet “open-source” standards to deserve visibility. In the Global South, data sharing involves complex realities, and making these resources discoverable can bring meaningful benefits to marginalized communities.
How can I contribute?
Lanfrica has always been community-powered. Whether you have time, expertise, or resources to share, there is a place for you in this project.
Link the datasets you know
Have you come across an African language dataset not yet on Lanfrica? Link it. Ensure to search the platform to confirm it isn’t already listed. Every single new record, no matter how small, strengthens the accuracy of the 2025 report and makes that resource discoverable to everyone who needs it.
Fund the project
Building the lightweight annotation platform, compensating annotators, and running the statistical analyses all require funding. Every contribution goes directly to these work streams and is transparently reported. If your organization can sponsor a specific language, domain, or task, we’ll gladly earmark the funds and acknowledge your support in the final report.
Volunteer your time or expertise
- Annotators & reviewers – Help tag datasets with key metadata (language, size, domain, license, etc.).
- Analysts & storytellers – Turn raw numbers into insights, visualizations, and narratives.
- Tool builders & testers – Shape the annotation platform and ensure it stays user-friendly and robust.
- Any other way you see fit.
Let us know which work package excites you, and we’ll plug you in.
Spread the word
Share this initiative with researchers, developers, linguists, and policymakers in your network. The broader our reach, the richer and more representative the final picture will be.
Connect with Us
For this project, you can connect with us in the following ways:
- Join our Discord community. This is our primary communication channel for this project. We will be having conversations, plannings, meetings, etc, about this project on Discord.
- We also have a Lanfrica community on WhatsApp and Telegram, for colleagues more comfortable in those spaces. We still encourage you to join the Discord server, if you can, as that is the primary communication platform for this project. With that said, what’s most important to us is that people are in communication spaces that they feel comfortable in, so we are accommodating.
- If you prefer, reach out to us via email at [email protected] or our phone number (on WhatsApp too) +1 250 609 4068.