Taha Tobaili

Taha Tobaili

PhD candidate in Natural Language Processing
Knowledge Media Institute

Sentiment Analysis for Low-Resourced Languages on Social Media

PhD Thesis

In 2016 I crossed the Mediterranean from the Middle East to spend four years in the UK and one in Germany researching and analysing the complex multilingual social media text. Lebanon, my home country, is one of the very few in the entire region where people constantly switch among English or French as they speak their dialectal Arabic, a dialect that is heavily influenced by French and Turkish. The modern generation started to reflect their natural multilingualism in social text by Latinising the dialectal Arabic without the need to switch between different scriptures.

As it is extremely low-resourced and overlooked in the literature of Natural Language Processing (NLP), Latinised Arabic or Arabizi made a perfect case for my PhD study. Let alone being low in NLP resources and code-switched with English and French, it is rich in morphology, and distinctively lacks a standard orthography. It is a genuine linguistic mess that defies the fundamental techniques of sentiment analysis, the task of classifying subjective text as positive, negative, or neutral.

With no availble NLP tools to utilise for Arabizi, I addressed a plethora of challenges the hard way, ingesting, preprocessing, and creating various datasets for sentiment analysis. I applied machine learning for language identification, then used deep learning to populate morphologically and orthographically rich sentiment lexicons for unsupervised sentiment classification of social media data.

Read more about the challenges and the approach here, published in Towards Data Science. Visit the project page for updates and published resources.

Industry Experience

Adarga AI, London

Recently I worked as an NLP Engineer to design the social media analysis pipeline from research to production. I developed a large-scale data ingestion library that retrieves Twitter data based on specific metrics such as date, location, and or topics of interest. I focused my research on retrieving information from the social network structure such as community detection and influencer identification.

IBM Watson, Böblingen

Previously I did an internship at IBM Watson Analytics for Social Media in Germany. One of Watson’s main strengths is mining and analysing large datasets from the internet in different languages, I was responsible for the NLP of the Arabic language. I developed a deep morphological processor that extracts gender, number, case, tense, and the base forms from richly inflected words.

More About Me

I am an optimistic human being, inquisitve by nature, and constantly challenging the mainstream. I tend to find my self learning something new, especially languages as it compliments my sociable character. Very fond of travel and adventure, I have been to many places and got lost in most of them. My energy is highly driven by sports, I have trained and competed professionaly in Swimming, Table Tennis, Chess, and Brasilian Jiu Jitsu but now trying to balance it with education. I'd rather be on a bike or skis somewhere than behind the computer!

Want to collaborate on Low Resourced, Multilingual, or Arabic NLP? Get in Touch!