Join our mailing list to get updates on our events, news, and the latest from the world of African language resources.

Your email is safe with us. We promise not to spam!
Please, consider giving your feedback on using Lanfrica so that we can know how best to serve you. To get started, .

An Exploration of Vocabulary Size and Transfer Effects in Multilingual Language Models for African Languages

Multilingual pretrained language models have been shown to work well on many languages, even those they were not originally pretrained on. Despite their empirical success in downstream tasks, there is still a gap in understanding of "what makes them tick''. In this paper, we try to understand the effects of sharing a vocabulary space on the cross-lingual abilities of a multilingual model. We train multiple monolingual and multilingual models and compare their effectiveness on downstream tasks. In monolingual models, a single language occupies the entire vocabulary space, limiting possible cross-lingual transfer. Whereas in a multilingual setting, the model benefits from cross-lingual transfer with a trade-off of having to split the vocabulary space between multiple languages. We present a comprehensive study of the effects of a shared vocabulary space, cross-script pretraining, and high-resource transfer on the cross-lingual abilities of multilingual models in zero- and few-shot settings. From our study, we observe that scaling the number of languages is beneficial for cross-lingual transfer in low-resource multilingual models up until a point, after which transfer effects saturate. We find that there is not much benefit from pretraining low-resource multilingual models with a high-resource language, and that cross-lingual transfer is possible even when the languages belong to different scripts. This empirical study is conducted in the context of three linguistically different low-resource African languages---Amharic, Hausa, and Swahili---and evaluation was performed on two different tasks, text classification and named entity recognition. During the course of our experiments, we also performed an audit of the quality of two common low-resource language corpora (Common Crawl and BBC News data).