Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis
On various Social Media platforms, people
tend to use the informal way to communicate,
or write posts and comments: their local dialects. In Africa, more than 1500 dialects and
languages exist. Particularly, Tunisians talk
and write informally using Latin letters and
numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based
Tunisian Arabizi dialectal dataset dedicated
for Sentiment Analysis. The dataset consists
of a total of 100k comments (about movies,
politic, sport, etc.) annotated manually by
Tunisian native speakers as Positive, Negative,
and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional
Encoder Representations from Transformers
(BERT) as a contextual language model in its
mul-tilingual version (mBERT) as an embedding technique then combining mBERT with
Convolutional Neural Network (CNN) as classifier.
Link