Join our mailing list to get updates on our events, news, and the latest from the world of African language resources.

Your email is safe with us. We promise not to spam!
Please, consider giving your feedback on using Lanfrica so that we can know how best to serve you. To get started, .
X
Filter

Filter Records

Languages

Loading...

Tasks

Loading...

Record Types

Loading...

Tags

Loading...

This repository contains the code for the paper Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages which appears in the first workshop on Multilingual Representation Learning at EMNLP 2021. AfriBE...

Expand Abstract

The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digi...

Expand Abstract

The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digi...

Expand Abstract

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural ...

Expand Abstract

AfroLID is a powerful neural toolkit for African languages identification which covers 517 African languages....

Expand Abstract

Code for the EMNLP 2021 Paper AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages.

Reproducible benchmarks are crucial in driving progress of machine translation research. However, existing machine translation benchmarks have been mostly limited to high-resource or well-represented languages. Despite an increasing interest in low-resource machine...

Expand Abstract

English (translated) Discourse function has often been noticed to be a strong factor in conditioning Bantu word order (Van der Wal 2015a; Downing & Hyman 2016; Downing & Marten 2019). Core concepts of discourse function are topic, defined as what the sentence is a...

Expand Abstract

A dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours of sentence-lengthed transcriptions. Data is mined from read New Testaments from http://www.bible.is/

This paper describes the CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours of sentence-lengthed transcriptions. We describe ...

Expand Abstract

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

We present CrossSum, a large-scale dataset comprising 1.65 million cross-lingual article-summary samples in 1500+ language-pairs constituting 45 languages. We use the multilingual XL-Sum dataset and align identical articles written in different languages via cross-...

Expand Abstract

Kirundi belongs to the linguistic family of Bantu languages and is spoken by more than 30 million people: especially in Burundi, but also in Ruanda, Tansania, parts of the DR Kongo, and of course in the Burundian diaspora in Germany and all over the world. DEKI can...

Expand Abstract

Founded in 1988, the Folio Group has grown from a tiny start-up into the major-league language service provider that it is today. This is largely driven by our reputation for reliability, technical expertise, fast turnaround and meticulous accuracy. Folio is recogn...

Expand Abstract

Gboard is a virtual keyboard app developed by Google for Android and iOS devices.