Join our mailing list to get updates on our events, news, and the latest from the world of African language resources.

Your email is safe with us. We promise not to spam!
Please, consider giving your feedback on using Lanfrica so that we can know how best to serve you. To get started, .
X
Filter

Filter Records

Languages

Loading...

Tasks

Loading...

Record Types

Loading...

Tags

Loading...

The African Storybook (ASb) is a multilingual literacy initiative that works with educators and children to publish openly licensed picture storybooks for early reading in the languages of Africa. An initiative of Saide, the ASb has an interactive website that enab...

Expand Abstract

Palatalization of coronals and stridents is well-known and widespread, and is most commonly associated with front vowels or glides as triggers. In some dialect(s) of Setswana, a much different type of palatalization occurs: alveolar stridents /s ts tsʰ/ become pre-...

Expand Abstract

Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represente...

Expand Abstract

We introduce a speech corpus containing multilingual code-switching compiled from South African soap operas. The corpus contains English, isiZulu, isiXhosa, Setswana and Sesotho speech, paired into four language-balanced subcorpora containing English-isiZulu, Engli...

Expand Abstract

This website contains information about African Languages, and other African Language related resources. Currently mostly only the South African languages are covered, as well as Kiswahili and Cilubà. It is estimated that there are between 2000 and 3000 languages ...

Expand Abstract

A first set of African Language Embeddings Word Embeddings Document Embeddings Currently Supporting Sepedi (Northern Sotho), nso Setswana, tsn

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural ...

Expand Abstract

AfroLID is a powerful neural toolkit for African languages identification which covers 517 African languages....

Expand Abstract

Multilingual information access is stipulated in the South African constitution. In practise, this is hampered by a lack of resources and capacity to perform the large volumes of translation work required to realise multilingual information access. One of the aims ...

Expand Abstract

Unlike major Western languages, most African languages are very low-resourced. Furthermore, the resources that do exist are often scattered and difficult to obtain and discover. As a result, the data and code for existing research has rarely been shared, meaning re...

Expand Abstract

This version of the Bloom Library data is developed specifically for the language modeling task. It includes data from nearly 400 languages across 35 language families, with many of the languages represented being extremely low resourced languages. Note: If you sp...

Expand Abstract

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices pr...

Expand Abstract

CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding...

Expand Abstract

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average preci...

Expand Abstract

Setswana language is one of the Bantu languages written disjunctively. Some of its parts of speech such as qualificatives and some adverbs are made up of multiple words. That is, the part of speech is made up of a group of words. The disjunctive style of writing po...

Expand Abstract