Arabic Offensive Language Detection in Social Media



Journal Title

Journal ISSN

Volume Title



Studies have shown that cyberhate, online harassment, and use of offensive language in social media are on the rise. Use of offensive language, even online, creates an exclusive environment and can even foster real-world violence. Despite significant technological advances in Natural Language Processing (NLP), online offensive language detection remains one of the most challenging text classification tasks due to the ambiguity and informality of the language used in user-generated content, as well as the social context of the users. Automatic offensive language detection is further complicated in languages with diverse forms and limited resources, such as Arabic. Arabic is spoken widely in the Middle East, encompasses multiple dialects and cultures, and represents a multitude of nationalities. The wide-adoption and the heterogeneity of Arabic affects also the social context that allows the interpretation and automatic recognition of offensive language. This dissertation proposes and develops methods for automatic offensive language detection for Arabic in social media. It explores transfer learning approaches to tackle this challenging task in a resource-constrained language that is rich in dialectal and cultural variations. Our studies show that: (1) advanced language models can capture the ambiguous, informal variations in the user generated text and accurately recognize offensive language; (2) different Arabic dialects can inform each other and significantly contribute to cross-dialect offensive language detection; (3) different user-generated content platforms do not add value to offensive language detection across platforms; (4) we can customize existing state of the art language models to improve their coverage of dialectal Arabic; and (5) dialectal Arabic language model outperforms the non-dialectal model on some offensive language detection datasets.