Please, consider giving your feedback on using Lanfrica so that we can know how best to serve you. To get started, .

Effect of Tokenisation Strategies for Low-Resourced Southern African Languages

Research into machine translation for African languages is very limited and low- resourced in terms of datasets and model evaluations. This work aims to add to the field of neural machine translation research, for four low-resourced Southern African languages. The effect of two byte pair encoding tokenisation algorithms (subword nmt and SentencePiece), with various parameters, are evaluated. The paper builds upon previous research in the field for comparison, using an opti- mised transformer architecture and pre-cleaned data to translate English to North- ern Sotho, Setswana, Xitsonga and isiZulu. The results obtained show improve- ments in the previous BLEU scores obtained for Setswana and isiZulu.