ANTC — African News Topic Classification Dataset
We created a novel dataset, ANTC — African News Topic
Classification for 4 African languages. We obtained data from three different news sources: VOA,
BBC6
and isolezwe7
. From the VOA data we created datasets for Lingala and Somali. We obtained the topics from data released by Palen-Michel et al. (2022) and used the provided urls to
get the news category from the websites. For pidgin and isiZulu, we scrapped news topic from the
respective news website (BBC Pidgin and isolezwe respectively) directly base on their category. We
noticed that some news topics are not mutually exclusive to their categories, therefore, we filtered
such topics with multiple labels. Also, we ensured that each category has at least 200 samples. The
categories include but not limited to, Africa, Entertainment, Health, and Politics. The pre-processed
datasets were divided into training, development, and test sets using stratified sampling with a ratio
of 70:10:20. Appendix A.2 has more details about the dataset size and news topic information.
Link