Join our mailing list to get updates on our events, news, and the latest from the world of African language resources.

Your email is safe with us. We promise not to spam!
Please, consider giving your feedback on using Lanfrica so that we can know how best to serve you. To get started, .
X
Filter

Filter Records

Languages

Loading...

Tasks

Loading...

Record Types

Loading...

Tags

Loading...

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wik...

Expand Abstract

AfriCLIRMatrix is a test collection for cross-lingual information retrieval research in 15 diverse African languages. This resource comprises English queries with query–document relevance judgments in 15 African languages automatically mined from Wikipedia...

Expand Abstract

Language diversity in NLP is critical in enabling the development of tools for a wide range of users.However, there are limited resources for building such tools for many languages, particularly those spoken in Africa.For search, most existing datasets feature few ...

Expand Abstract

This repository contains code to reproduce Better Quality Pre-training Data and T5 Models for African Languages which appears in the 2023 conference on Empirical Methods in Natural Language Processing (EMNLP). AfriTeVa V2 was trained on 20 languages (16 African La...

Expand Abstract

In this paper we describe a system developed to identify a set of four regional Arabic dialects (Egyptian, Gulf, Levantine, North African) and Modern Standard Arabic (MSA) in a transcribed speech corpus. We competed under the team name MAZA in the Arabic Dialect Id...

Expand Abstract

This paper presents the systems submitted by the MAZA team to the Arabic Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign 2017. The goal of the task is to evaluate computational models to identify the dialect of Arabic utterances using bo...

Expand Abstract

Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide...

Expand Abstract

This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evalua...

Expand Abstract

In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new mu...

Expand Abstract

Introduction BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Egyptian Arabic discussion forum (DF), SMS/Chat and conversation...

Expand Abstract

Introduction BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) and consists of propbank annotation ...

Expand Abstract

Introduction BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving ...

Expand Abstract

Introduction BOLT Egyptian Arabic SMS/Chat Parallel Training Data was developed by the Linguistic Data Consortium (LDC) and consists of approximately 723,000 tokens of Egyptian Arabic SMS/Chat data collected for the DARPA BOLT program along with their correspondin...

Expand Abstract

Introduction BOLT Egyptian Arabic Treebank - Conversational Telephone Speech was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic conversational telephone speech data with part-of-speech annotation, morphology, gloss and syntactic ...

Expand Abstract

Introduction BOLT Egyptian Arabic Treebank -- Discussion Forum was developed by the Linguistic Data Consortium (LDC) and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation. The DAR...

Expand Abstract