{"id":5193,"date":"2026-03-06T19:57:00","date_gmt":"2026-03-06T19:57:00","guid":{"rendered":"https:\/\/lanfrica.com\/blog\/?post_type=insight&#038;p=5193"},"modified":"2026-06-19T23:13:48","modified_gmt":"2026-06-19T23:13:48","slug":"understanding-the-african-next-voices-datasets","status":"publish","type":"insight","link":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/","title":{"rendered":"Understanding the African Next Voices Datasets"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>We provide the first holistic, evidence-driven analysis of the single largest African data set creation effort in history: 7 countries, 24 languages, 18,000+ hours!<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the early 2020s, publicly available speech datasets of at least 500 hours for many African languages were non-existent. What existed were small-to-medium, high-quality datasets built through community participatory approaches, pioneered by researchers, labs, and grassroots efforts. These were good enough for research benchmarking and moderate finetuning, but never sufficient to build robust AI applications for African languages that could be deployed in the real world.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And so, the closest thing we had to &#8220;large-scale&#8221; was the web. But even that was not particularly useful for real-world AI; the quality was often subpar[<a href=\"https:\/\/arxiv.org\/abs\/2103.12028\">1<\/a>,<a href=\"https:\/\/aclanthology.org\/2025.acl-long.370\/\">2<\/a>], and it failed to represent the real contexts and environments of the communities that speak these languages.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A few efforts to create large datasets[<a href=\"https:\/\/naijavoices.com\/\">3<\/a>,<a href=\"https:\/\/lanfrica.com\/en\/record\/kencorpus-kenyan-languages-corpus-for-machine-learning-and-natural-language-processing\">4<\/a>,<a href=\"https:\/\/lanfrica.com\/en\/record\/kinyarwanda-commonvoice-speech-dataset\">5<\/a>,<a href=\"https:\/\/lanfrica.com\/en\/record\/zambezi-voice\">6<\/a>,<a href=\"https:\/\/lanfrica.com\/en\/record\/sautidb-nigerian-accent-dataset-collection\">7<\/a>,<a href=\"https:\/\/lanfrica.com\/en\/record\/the-makerere-radio-speech-corpus-a-luganda-radio-corpus-for-automatic-speech-recognition-1\">8<\/a>] slowly sprouted over the years, challenging the narrative that African communities could only create small data samples. Some yielded success for selected African languages, but none did so across multiple languages at once (think 10, 20 African languages).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">All that has changed now.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The African Next Voices Project<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The African Next Voices Project, active roughly between 2024 to 2025, was a large-scale, multi-stakeholder, ambitious initiative to create sufficiently large volumes (&gt; 500 hours) of speech with aligned transcripts for African languages across the continent, with the goal of accelerating the development of practical AI applications and use-cases to address African problems. It was indeed multi-stakeholder: involving different dataset creators, universities, experts, and consultants across the continent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Today, the project is largely complete. Yet many of us have not fully grasped what this means.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We have entered a new era: a meaningful percentage of African languages now have more than 500 hours of speech-text data, and in some cases, up to thousands of hours. This enables unprecedented speech-text applications that go beyond research labs into the real world; into startups building products, public services for citizens and governments. It is a prime time to be working on speech text for African languages!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"#but\"><\/a>But\u2026<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine you are building speech\u2013text applications for African languages. You want to leverage the ANV datasets for your work. Where do you even start? Where can you find the Setswana dataset? Did you know the ANV catalogue includes Wolaytta, or that Bambara now has over 600 hours of speech data you can use? Do these datasets adequately represent female voices? What dialects are (not) present? What biases or harms could stem from using these datasets?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The African AI ecosystem is vast, fragmented, and full of moving parts, so we tend to see only the individual bits and pieces. When you focus on one project here and another dataset there, it is hard to grasp what is really happening. The ANV datasets are no exception: they were created by different organisations, using different methodologies for different languages, and released under different procedures and timelines. Without a unifying view that connects these efforts and presents them as a coherent whole, it becomes difficult to see the full picture of the African Next Voices project\u2014to understand its true scale, its geographic and linguistic reach, and its impact.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That is the gap this insight aims to fill.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"#where-are-the-anv-datasets\"><\/a>Where are the ANV datasets?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At Lanfrica Insights, we started by mapping all the datasets released under the ANV project; linking each one to its accompanying paper, relevant articles, and all hosting sources, and enriching each entry with key metadata like the country, license, and languages covered. In doing so, we have made it trivial for anyone \u2013 from AI developers to civil society \u2013 to find all the ANV datasets with one click:<a href=\"https:\/\/lanfrica.com\/en\/discover?tag=African+Next+Voices\"> https:\/\/lanfrica.com\/en\/discover?tag=African+Next+Voices<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"#how-we-performed-the-analysis\"><\/a><strong>How We Performed the Analysis<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Our principle for this analysis was simple: whenever possible, we calculated the statistics ourselves rather than relying solely on the values reported. This meant downloading the dataset metadata and, in many cases, the datasets themselves \u2014 sometimes hundreds of gigabytes of audio \u2014 in order to compute the values directly from the data. This necessitated efficient data processing strategies to accommodate storage space and time constraints (<em>when life gives you lemons&#8230;<\/em>). We only relied on reported values in cases where the statistics could not reasonably be computed from the released data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our approach allowed us to ensure consistency, reproducibility, and reliability across the analysis. By computing the statistics ourselves, we were able to base the analysis on what is actually present in the datasets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We compared our calculated values with the numbers reported by the dataset creators. In some, the values were consistent. Where differences appeared, we indicate them in our visualizations. For example, some dataset creators intentionally withheld portions of their evaluation data; the South African and Kenyan datasets, for instance, withheld approximately 5% of the test data from public release, so we could not analyse them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where needed, we reached out directly to dataset creators to seek clarification or request additional metadata. This step allowed us to connect with the teams who built the datasets and ground our analysis in the realities of how the data was collected and documented.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Below are the results of our investigation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"#understanding-the-african-next-voices-datasets\"><\/a>Understanding the African Next Voices Datasets<\/h3>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1.jpg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Geographic coverage of the African Next Voices Datasets. Here we present the reported number of speech hours<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The geographic atlas above illustrates the countries where dataset collection took place for the African Next Voices, including the languages covered and the total reported speech hours in each country. The ANV project is made up of 7 datasets that together cover 24 African languages from speakers across 7 countries spanning the Western, Eastern and Southern regions of the continent. The combined total number of hours is about 18,000!<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<h3 class=\"wp-block-heading\">7 countries, 24 languages, 18,000+ hours!<\/h3>\n<\/blockquote>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"#who-are-the-stakeholders-behind-the-african-next-voices-project\"><\/a>Who are the stakeholders behind the African Next Voices Project?<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Like everything done well on the African continent, it takes a village. Drawing from the organizations listed on the dataset cards, we map the stakeholders and partners involved in the African Next Voices project, from the funders collective to the dataset-creator institutions.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/stakeholders-updated-map.svg\" alt=\"\"\/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"#how-much-speech-data-is-there\"><\/a>How much speech data is there?<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Across the 24 African languages, the dataset volume varies.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/2.svg\" alt=\"\" class=\"wp-image-5197\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><br>The distribution shows a few very large datasets at the top, with Kinyarwanda and Swahili reaching several thousand hours. But the more important story lies in the middle.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most ANV languages cluster between roughly 400 and 700 hours, and the median language has over 500 hours of speech. In other words, half of the languages in the collection already exceed the 500-hour mark. Just a few years ago, reaching 500 hours for a single African language was rare; today, this level of scale represents the midpoint of the ANV ecosystem.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How the recordings were created<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When creating speech datasets, there are two main approaches to obtaining the audio from speakers: read (\/red\/) speech and spontaneous speech. In read speech, a text is prepared beforehand through some mechanism, and the speaker is asked to read the text aloud exactly as written. The speech therefore follows the structure and wording of the written text. In spontaneous speech however, there is no text for the speaker to read from. Instead, the speaker is prompted to talk naturally. These prompts can take different forms; for example, speakers may be shown an image, asked a question, or given a topic to talk about. The key difference is that the speech is not constrained by a written script. Speakers express themselves freely, using their own words and natural speech patterns. Because there is no predefined text, the speech must then be transcribed afterward, where annotators listen to the audio and write down what the speaker said.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Within the sometimes divided African speech data creation ecosystem, one point is unanimously agreed on: spontaneous speech is the preferred method for truly encoding the natural, complex speech patterns of African languages.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/3.svg\" alt=\"\" class=\"wp-image-5198\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We analyzed the ANV recordings along this dimension. The visualization above shows the proportion of read speech versus spontaneous speech across the languages in the ANV datasets, with the percentage of spontaneous speech labeled for each language.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Across all languages, spontaneous speech clearly dominates. The lowest share of spontaneous speech in the datasets is 71%, while some languages, such as Swahili and Bambara, consist of 100% spontaneous speech. This pattern suggests that the dataset creators and partners involved in the African Next Voices project deliberately prioritized spontaneous speech when collecting recordings. In many ways, this reflects an important design choice; capturing spontaneous speech allows datasets to better reflect how these languages are actually spoken in everyday contexts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At the same time, spontaneous speech is not without its challenges. Compared to read speech, it is significantly more difficult and expensive to transcribe, especially in African languages. A number of factors contribute to this difficulty, including low literacy levels, the fact that many African languages remain predominantly oral, and the absence or limited standardization of writing conventions for many African languages. <a href=\"https:\/\/arxiv.org\/pdf\/2510.12781\">For example, the RobotsMali team&#8217;s dataset paper details the complicated hurdles and expense of transcribing Bambara in Mali<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Our analysis did not evaluate the nature, properties, or quality of the transcripts themselves. However, it remains an important aspect to think about while using these datasets.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Behind the Hours: Different Speaker Communities<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/8.svg\" alt=\"\" class=\"wp-image-5201\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Different language datasets arrive at their volume through very different community engagement patterns. Take language datasets that fall within the same hour band of, say, 500 to 700 hours. You might expect their speaker size to be similar. But that is not what we see. Take Bambara, with 612 hours featuring 512 speakers, and contrast it with Tigrinya, with 601.09 hours (about 11 hours less) and 758 speakers (but a whopping 246 speakers more!). They are in the same range of hours, yet the number of unique speakers differs enormously.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This matters because speaker size variation affects the resulting AI model\u2019s generalization across different voices, accents, and speaking styles within a language. Our interpretation is that different dataset creators used different speaker-to-hour ratios; how many hours each unique speaker was allowed to record. This likely reflected the geographic and logistical realities in each country.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Gender representation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">A prevailing pattern in datasets, especially on the African continent, is that you get less female participation. This matters because the resulting AI models can (and do) inherit the biases of the data they are trained on. If there is significant gender imbalance in the data and it is not recognized or mitigated, the AI will inevitably reproduce that imbalance.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/7.svg\" alt=\"\" class=\"wp-image-5200\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">At the macro level, the 24 ANV language datasets are split in the middle in terms of their gender representation: almost half of the language datasets have more male speakers than female, and the other half have more female speakers than male speakers. The size of the bubble in the figure captures the degree of this male-female difference. Bambara is the language dataset with the largest male-majority. The Bambara dataset creators, RobotsMali, documented this <a href=\"https:\/\/arxiv.org\/abs\/2511.18557\">in their paper<\/a>, explaining that social norms in Mali around women&#8217;s participation in public activities made female recruitment very difficult.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/5.svg\" alt=\"\" class=\"wp-image-5199\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">At the micro level, when you look at the male-female speaker split across all 24 languages ranked by female speaker proportion, you get a clear picture of where gender gaps are largest and smallest. This finding enables the general public, including policymakers and funders, to understand the gender dimensions of the largest concentrated African speech dataset effort to date.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#the-adults-are-abundant-the-elderly-are-rare\"><\/a>The Adults Are Abundant, the Elderly Are Rare<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">We set out to investigate how useful the ANV datasets could be for someone building applications for a specific age demographic. Think, for instance, of a speech-based application designed to assist elderly patients dealing with dementia or Alzheimer&#8217;s.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/9.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For this analysis, we created two bins: adults (18 to 49 years) and elderly (50 years and above). What we found is that the ANV recordings are largely contributed by young and middle-aged adults, with very little participation from people aged 50 and above; as little as 2.4% of the total hours (<em>that is barely 300 hours out of 18,000 hours)<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/11.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">By zooming into the micro level details in the figure below, the very limited elderly participation comes from just a handful of languages, with the most coming from Somali.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#domains-use-cases-and-impact\"><\/a>Domains, Use Cases, and Impact<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we looked at the domains covered by the datasets. Some of the released datasets documented the domains their recordings covered, and we used that information to get a picture of the domain coverage across all ANV datasets, giving us insight into potential use cases of the datasets.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/12.svg\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Important domains like agriculture, health, finance, and everyday conversation are well represented in the datasets. There are also other more specific and interesting domains in the mix like fashion and art. Therefore, the ANV datasets can be used to build practical AI solutions in Africa, cutting across sectors from healthcare and agriculture to government services.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What\u2019s more \u2013 all these large volumes of data are licensed under the CC BY 4.0 license, enabling both research and commercial applications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, there are restrictions on how the dataset can be used. For example, the Swivuriso: ZA-African Next Voices dataset<a href=\"https:\/\/huggingface.co\/datasets\/dsfsi-anv\/za-african-next-voices#use-restriction\"> explicitly prohibits<\/a> the use of its data for text-to-speech (TTS), voice cloning, voice synthesis, or any technology or activity intended to replicate, mimic, or generate human voices.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#research-readiness-profiles-of-the-anv-datasets\"><\/a>Research Readiness Profiles of the ANV Datasets<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">When you want to use a dataset for research, one factor that enables it is the presence of a clear train\/dev\/test split.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These splits make research workflows much easier; you can train or fine-tune your model on the training portion, use the development set for validation and model selection, and reserve the test set for final benchmarking. This separation helps ensure that performance results reflect real generalization rather than overfitting to the data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the table below, we document our findings on the availability and characteristics of train\/dev\/test splits across the ANV datasets to provide a profile of their research readiness.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Dataset Name<\/th><th>Train<\/th><th>Dev<\/th><th>Test<\/th><th>Comments<\/th><\/tr><\/thead><tbody><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>\u2705<\/td><td>\u2705<\/td><td>\u2705<\/td><td>As the test set, the creators release a <code>dev_test<\/code> split for the usual benchmarking. Additionally 5% of the data has been withheld from public release*.<\/td><\/tr><tr><td>Mali African Next Voices<\/td><td>\u2705<\/td><td>\u274c\ufe0f<\/td><td>\u2705<\/td><td><\/td><\/tr><tr><td>Rwanda African Next Voices<\/td><td>\u2705<\/td><td>\u2705<\/td><td>\u2705<\/td><td>This dataset was split based on the domains covered.<\/td><\/tr><tr><td>African Voices: Multilingual Speech Dataset for Low-Resource African Languages (Nigeria)<\/td><td>\u2705<\/td><td>\u2705<\/td><td>\u2705<\/td><td>This dataset is not yet publicly available, so we could not analyse the split.<\/td><\/tr><tr><td>African Next Voices: Pilot Data Collection in Kenya<\/td><td>\u2705<\/td><td>\u2705<\/td><td>\u2705<\/td><td>5% of the data has been withheld from public release*.<\/td><\/tr><tr><td>African Next Voices: Kenya and Tanzania<\/td><td>\u2705<\/td><td>\u2705<\/td><td>\u2705<\/td><td>The dataset was split based on the domains covered.<\/td><\/tr><tr><td>African Next Voices: Ethiopia<\/td><td>\u2705<\/td><td>\u2705<\/td><td>\u2705<\/td><td>The dataset was split to ensure no speaker overlap: speakers in the training set do not appear in the validation or test sets.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The decision to withhold the data from public release could serve important purposes. First, it helps prevent data contamination in the current AI landscape, where large models are increasingly trained on indiscriminately scraped web data. If the full dataset were openly accessible, there is a high likelihood that it would be unintentionally absorbed into large-scale pretraining corpora. Furthermore, within the research community, withholding a portion of the data reduces the temptation to overfit on the test set.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">On the Multilingual Mosaic of African Expression: Dialects and Code-switching<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Dialects are intricately linked to African languages and the whole discourse about how we express ourselves and communicate in African languages; the idea of a single static linguistic identity does not really exist here; it is always fluid, always lived, always mixing with other forms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Nigeria alone is home to over 500 native languages, comprising many dialects and variants shaped by local histories and interactions<sup>[<\/sup><a href=\"https:\/\/en.wikipedia.org\/wiki\/Languages_of_Nigeria\"><sup>source<\/sup><\/a><sup>]<\/sup>. In Kenya, there are about 68 living languages, spread across regions and ethnic communities; many speakers grow up learning at least one indigenous language alongside others in their communities<sup>[<\/sup><a href=\"https:\/\/statskenya.co.ke\/at-stats-kenya\/about\/how-many-languages-are-spoken-in-kenya-languages-and-dialects\/123\/\"><sup>source<\/sup><\/a><sup>]<\/sup>. In South Africa, there are twelve official languages, and most people speak several of these languages and their internal dialects as part of everyday communication, work, schooling, and cultural life<sup>[<\/sup><a href=\"https:\/\/en.wikipedia.org\/wiki\/Demographics_of_South_Africa\"><sup>source<\/sup><\/a><sup>]<\/sup>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We therefore investigate if and how the ANV datasets capture this complexity; through the lenses of dialects and code switching.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Dataset Name<\/th><th>Language<\/th><th>Dialects<\/th><th>Note on code-switching<\/th><\/tr><\/thead><tbody><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>isiZulu<\/td><td>No dialectal information reported<\/td><td>Code-switching is represented. It is indicated with <code>[cs]<\/code> in the transcript column.<\/td><\/tr><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>isiXhosa<\/td><td>No dialectal information reported<\/td><td>Code-switching is marked with <code>[cs]<\/code> in the transcript.<\/td><\/tr><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>Sesotho<\/td><td>No dialectal information reported<\/td><td>Code-switching is marked with <code>[cs]<\/code> in the transcript.<\/td><\/tr><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>Xitsonga<\/td><td>No dialectal information reported<\/td><td>Code-switching is marked with <code>[cs]<\/code> in the transcript.<\/td><\/tr><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>Setswana<\/td><td>No dialectal information reported<\/td><td>Code-switching is marked with <code>[cs]<\/code> in the transcript.<\/td><\/tr><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>isiNdebele<\/td><td>No dialectal information reported<\/td><td>Code-switching is marked with <code>[cs]<\/code> in the transcript.<\/td><\/tr><tr><td>Swivuriso: ZA-African Next Voices<\/td><td>Tshivenda<\/td><td>No dialectal information reported<\/td><td>Code-switching is marked with <code>[cs]<\/code> in the transcript.<\/td><\/tr><tr><td>Afrivoice-Bambara<\/td><td>Bambara<\/td><td>Not reported<\/td><td>\u200b<a href=\"https:\/\/arxiv.org\/pdf\/2511.18557\">According to the authors<\/a>, code-switching was limited to occasional French, leaving the broader code-switching patterns of the language community largely unrepresented.<\/td><\/tr><tr><td>Afrivoice_Kinyarwanda<\/td><td>Kinyarwanda<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><tr><td>African Next Voices: Pilot Data Collection in Kenya<\/td><td>Dhuluo<\/td><td>Nyandwat, Milambo<\/td><td>Not reported; but deeper transcript inspection reveals <code>[cs]<\/code> code-switching tags, primarily in unscripted speech.<\/td><\/tr><tr><td>African Next Voices: Pilot Data Collection in Kenya<\/td><td>Kikuyu<\/td><td>G\u0129-Kabete, Ki-Mathira, Ki-Muranga, Ki-Ndia &amp; G\u0129-Gichugu<\/td><td>Not reported; but deeper transcript inspection reveals <code>[cs]<\/code> code-switching tags, primarily in unscripted speech.<\/td><\/tr><tr><td>African Next Voices: Pilot Data Collection in Kenya<\/td><td>Somali<\/td><td>Maxatire<\/td><td>Not reported; but deeper transcript inspection reveals <code>[cs]<\/code> code-switching tags, primarily in unscripted speech.<\/td><\/tr><tr><td>African Next Voices: Pilot Data Collection in Kenya<\/td><td>Maasai<\/td><td>Nandi &amp; Kipsigis<\/td><td>Not reported; but deeper transcript inspection reveals <code>[cs]<\/code> code-switching tags, primarily in unscripted speech.<\/td><\/tr><tr><td>African Next Voices: Pilot Data Collection in Kenya<\/td><td>Kalenjin<\/td><td>Kimasaai &amp; Kisamburu<\/td><td>Not reported; but deeper transcript inspection reveals <code>[cs]<\/code> code-switching tags, primarily in unscripted speech.<\/td><\/tr><tr><td>Afrivoice_Swahili<\/td><td>Swahili<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><tr><td>Afrivoice_Ethiopia<\/td><td>Amharic<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><tr><td>Afrivoice_Ethiopia<\/td><td>Afaan Oromo<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><tr><td>Afrivoice_Ethiopia<\/td><td>Sidama<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><tr><td>Afrivoice_Ethiopia<\/td><td>Tigrinya<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><tr><td>Afrivoice_Ethiopia<\/td><td>Wolaytta<\/td><td>Not reported<\/td><td>Not reported<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">From the table, two patterns become immediately visible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, dialect information is unevenly documented. A few datasets, particularly the Kenyan pilot datasets, explicitly list the dialects represented in the recordings, giving us some visibility into the internal linguistic variation of those languages. But for most of the other datasets, dialect coverage is either not reported or simply absent from the metadata.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Code-switching is moderately captured in the datasets \u2014 often embedded within <code>[cs]<\/code> tags in the transcripts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#limitations\"><\/a>Limitations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While this analysis provides a sufficiently broad overview of the African Next Voices datasets, there are important aspects that we do not evaluate directly.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Error analysis is missing in this work: we did not assess the quality of the audio recordings. Factors such as recording clarity, microphone quality, background noise, and recording consistency can significantly affect the usability of a speech dataset, but evaluating these characteristics was outside the scope of this work.<\/li>\n\n\n\n<li>We do not evaluate the quality of the transcripts. Exploring the linguistic authenticity of the written texts is crucial for downstream tasks such as speech recognition and language modeling, but requires expert linguistic expertise.<\/li>\n\n\n\n<li>This study excludes harmscape analysis, diving into the potential use-cases \u2014 especially the bad \u2014 as well as the potential biases and harms that could come from the dataset. From issues like voice cloning being used for political manipulation and inciting gender-based violence, to potential de-anonymization of speakers in order to cause harm, these issues are real, important, and deserve their dedicated analysis.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#metadata-gaps\"><\/a>Metadata Gaps<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Our work also reveals crucial metadata gaps across datasets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While many datasets capture basic speaker characteristics such as age and gender, other characteristics are often missing. For one, a few datasets did not have a duration column, making it hard for our hour-based analysis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, dataset documentation practices were not standardized. Many dataset cards adopt different templates and provide little to no information about how the data was collected, processed, and maintained \u2013 some datasets have published papers that document it, with others are in progress. Some good dataset documentation templates include <a href=\"https:\/\/arxiv.org\/abs\/1803.09010\">Datasheets for Datasets<\/a>, <a href=\"https:\/\/huggingface.co\/docs\/hub\/datasets-cards\">HuggingFace data cards<\/a>, and the nutrition labels from <a href=\"https:\/\/datanutrition.org\/\">the Data Nutrition project<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By highlighting the gaps, we hope to help dataset creators recognize that proper documentation \u2014 metadata, dataset statistics, field descriptions, and collection processes \u2014 plays a crucial role in making datasets usable by the wider community. Without this documentation layer, datasets that required enormous effort to create can remain difficult to understand, and use <em>responsibly<\/em>. Through this work, we seek to bridge this often-overlooked reproducibility layer between data creation and real-world impact.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#about-lanfrica-insights\"><\/a>About Lanfrica Insights<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The Lanfrica Insights team aims to improve public awareness and understanding of AI in Africa. We do this through analysis and evidence-based insights that examine the forces shaping the AI ecosystem on the continent. Through this work, we investigate emerging trends, identify gaps and bottlenecks, and develop scientifically grounded analyses that help make the state of AI in Africa more visible, understandable, and actionable. Learn more <a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#reach-out\"><\/a>Reach out<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We are accessible via email: <a href=\"mailto:insights@lanfrica.com\">insights@lanfrica.com<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can reach out for any of the following reasons:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>There is some wrong information in this article that needs to be corrected. This article is published as a live document, allowing us to make edits where necessary to make it a reliable reference. The version info at the bottom informs readers when the article was last edited, ensuring transparency.<\/li>\n\n\n\n<li>For collaboration requests or other opportunities<\/li>\n\n\n\n<li>For general feedback on anything else<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Additionally, consider connecting with our broader work at Lanfrica. Learn more <a href=\"https:\/\/docs.lanfrica.com\/support\/help-and-support\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#how-to-cite-this-work\"><\/a>How to cite this work<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">All visualizations presented in this report are licensed under the CC-BY license, ensuring they are freely available for any use. If you use this analysis or any of the visualizations from this report, please cite it as:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Lanfrica Insights (2026). Understanding the African Next Voices Datasets.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">BibTeX<\/p>\n\n\n\n<div class=\"lf-bibtex\"><div class=\"lf-bibtex__head\"><span class=\"lf-bibtex__label\">BibTeX<\/span><button type=\"button\" class=\"lf-bibtex__copy\" aria-label=\"Copy BibTeX citation\"><svg width=\"14\" height=\"14\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"lf-bibtex__txt\">Copy<\/span><\/button><\/div><pre class=\"lf-bibtex__code\"><code>@article{lanfrica2026anv,\n  title       = {Understanding the African Next Voices Datasets},\n  author      = {{Lanfrica Insights}},\n  year        = {2026},\n  institution = {Lanfrica Labs},\n  license     = {CC-BY}\n}<\/code><\/pre><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"https:\/\/app.gitbook.com\/o\/j7Po8PnTfZADy2AZOtaC\/s\/uuJVCtpGJDjNDuqgjiZa\/understanding-the-african-next-voices-datasets#sign-up-to-receive-our-latest-insights\"><\/a><\/h4>\n","protected":false},"excerpt":{"rendered":"<p>The first holistic, evidence-driven analysis of the largest African speech-dataset effort in history \u2014 African Next Voices: 7 countries, 24 languages, 18,000+ hours.<\/p>\n","protected":false},"author":1,"featured_media":5206,"template":"","meta":{"inline_featured_image":false,"footnotes":""},"insight_topic":[],"ppma_author":[119],"class_list":["post-5193","insight","type-insight","status-publish","has-post-thumbnail","hentry"],"ppma_authors":[{"display_name":"Lanfrica","slug":"lanfrica","avatar":"https:\/\/secure.gravatar.com\/avatar\/83f4a38b43593dcd229a29dc1c23bbc0ca0dd5cc875c1dca530aae3be531325e?s=96&d=mm&r=g","has_avatar":false,"url":"https:\/\/blog-origin.lanfrica.com\/blog\/author\/lanfrica\/","is_guest":false}],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Understanding the African Next Voices Datasets - Lanfrica Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lanfrica.com\/en\/blog\/insight\/understanding-the-african-next-voices-datasets\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Understanding the African Next Voices Datasets - Lanfrica Blog\" \/>\n<meta property=\"og:description\" content=\"The first holistic, evidence-driven analysis of the largest African speech-dataset effort in history \u2014 African Next Voices: 7 countries, 24 languages, 18,000+ hours.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lanfrica.com\/en\/blog\/insight\/understanding-the-african-next-voices-datasets\" \/>\n<meta property=\"og:site_name\" content=\"Lanfrica Blog\" \/>\n<meta property=\"article:modified_time\" content=\"2026-06-19T23:13:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@lanfrica\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"19 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/\",\"url\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/\",\"name\":\"Understanding the African Next Voices Datasets - Lanfrica Blog\",\"isPartOf\":{\"@id\":\"https:\/\/lanfrica.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg\",\"datePublished\":\"2026-03-06T19:57:00+00:00\",\"dateModified\":\"2026-06-19T23:13:48+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#primaryimage\",\"url\":\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg\",\"contentUrl\":\"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg\",\"width\":2560,\"height\":1440},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lanfrica.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Understanding the African Next Voices Datasets\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lanfrica.com\/blog\/#website\",\"url\":\"https:\/\/lanfrica.com\/blog\/\",\"name\":\"Lanfrica Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/lanfrica.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lanfrica.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lanfrica.com\/blog\/#organization\",\"name\":\"Lanfrica\",\"url\":\"https:\/\/lanfrica.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lanfrica.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ww2.lanfrica.com\/wp-content\/uploads\/2022\/05\/cropped-favicon-1.png\",\"contentUrl\":\"https:\/\/ww2.lanfrica.com\/wp-content\/uploads\/2022\/05\/cropped-favicon-1.png\",\"width\":512,\"height\":512,\"caption\":\"Lanfrica\"},\"image\":{\"@id\":\"https:\/\/lanfrica.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/lanfrica\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Understanding the African Next Voices Datasets - Lanfrica Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lanfrica.com\/en\/blog\/insight\/understanding-the-african-next-voices-datasets","og_locale":"en_US","og_type":"article","og_title":"Understanding the African Next Voices Datasets - Lanfrica Blog","og_description":"The first holistic, evidence-driven analysis of the largest African speech-dataset effort in history \u2014 African Next Voices: 7 countries, 24 languages, 18,000+ hours.","og_url":"https:\/\/lanfrica.com\/en\/blog\/insight\/understanding-the-african-next-voices-datasets","og_site_name":"Lanfrica Blog","article_modified_time":"2026-06-19T23:13:48+00:00","og_image":[{"width":2560,"height":1440,"url":"https:\/\/lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_site":"@lanfrica","twitter_misc":{"Est. reading time":"19 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/","url":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/","name":"Understanding the African Next Voices Datasets - Lanfrica Blog","isPartOf":{"@id":"https:\/\/lanfrica.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#primaryimage"},"image":{"@id":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#primaryimage"},"thumbnailUrl":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg","datePublished":"2026-03-06T19:57:00+00:00","dateModified":"2026-06-19T23:13:48+00:00","breadcrumb":{"@id":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#primaryimage","url":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg","contentUrl":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-content\/uploads\/2026\/06\/african-map-1-scaled.jpg","width":2560,"height":1440},{"@type":"BreadcrumbList","@id":"https:\/\/blog-origin.lanfrica.com\/blog\/insight\/understanding-the-african-next-voices-datasets\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lanfrica.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Understanding the African Next Voices Datasets"}]},{"@type":"WebSite","@id":"https:\/\/lanfrica.com\/blog\/#website","url":"https:\/\/lanfrica.com\/blog\/","name":"Lanfrica Blog","description":"","publisher":{"@id":"https:\/\/lanfrica.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lanfrica.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lanfrica.com\/blog\/#organization","name":"Lanfrica","url":"https:\/\/lanfrica.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lanfrica.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/ww2.lanfrica.com\/wp-content\/uploads\/2022\/05\/cropped-favicon-1.png","contentUrl":"https:\/\/ww2.lanfrica.com\/wp-content\/uploads\/2022\/05\/cropped-favicon-1.png","width":512,"height":512,"caption":"Lanfrica"},"image":{"@id":"https:\/\/lanfrica.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/lanfrica"]}]}},"_links":{"self":[{"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/insight\/5193","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/insight"}],"about":[{"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/types\/insight"}],"author":[{"embeddable":true,"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"version-history":[{"count":5,"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/insight\/5193\/revisions"}],"predecessor-version":[{"id":5213,"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/insight\/5193\/revisions\/5213"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/media\/5206"}],"wp:attachment":[{"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/media?parent=5193"}],"wp:term":[{"taxonomy":"insight_topic","embeddable":true,"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/insight_topic?post=5193"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog-origin.lanfrica.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=5193"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}