Cookies are used on the Lanfrica website to ensure you get the best experience.
We created a novel dataset, ANTC — African News Topic Classification for 4 African languages. We obtained data from three different news sources: VOA, BBC6 and isolezwe7 . From the VOA data we created datasets for Lingala and Somali. We obtained the topics from data released by Palen-Michel et al. (2022) and used the provided urls to get the news category from the websites. For pidgin and isiZulu, we scrapped news topic from the respective news website (BBC Pidgin and isolezwe respectively) directly base on their category. We noticed that some news topics are not mutually exclusive to their categories, therefore, we filtered such topics with multiple labels. Also, we ensured that each category has at least 200 samples. The categories include but not limited to, Africa, Entertainment, Health, and Politics. The pre-processed datasets were divided into training, development, and test sets using stratified sampling with a ratio of 70:10:20. Appendix A.2 has more details about the dataset size and news topic information.