Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

On various Social Media platforms, people tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, Negative, and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its mul-tilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier.

Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

CONNECTED RECORDS

LANGUAGES

TASKS

TAGS