Taha Tobaili Mount Rigi, Switzerland

Taha Tobaili

NLP Engineer
London, UK

Welcome to my homepage!

Herzlich Willkommen, Bienvenido, Hoşgeldiniz, أهلاً وسهلاً

Solving linguistic complexities to develop Natural Language Understanding (NLU) machines always sparked my interest, specifically in dealing with lower-resourced languages or multi-lingual contexts. This stems from my passion in languages and affective computing, the discipline of teaching machines to understand and react to human emotion.

I did my PhD in Natural Language Processing (NLP) at the Knowledge Media Institute, the Open University, Milton Keynes campus in UK, and completed it in November 2020. Since then I have been helping my friends build an EdTech (Education Technology) for an early-stage startup in London.

P.S: Something so exciting is coming up... 😏

Sentiment Analysis for Low-Resourced Languages on Social Media

PhD Thesis

In 2016 I crossed the Mediterranean from the Middle East to spend four years in the UK and one in Germany researching and analysing the complex multilingual social media text. Lebanon, my home country, is one of the very few in the entire region where people constantly switch among English or French as they speak their dialectal Arabic, a dialect that is heavily influenced by French and Turkish. The modern generation started to reflect their natural multilingualism in social text by Latinising the dialectal Arabic without the need to switch between different scriptures.

As it is extremely low-resourced and overlooked in the literature of NLP, Latinised Arabic or Arabizi made a perfect case for my PhD study. Let alone being low in NLP resources and code-switched with English and French, it is rich in morphology, and distinctively lacks a standard orthography. It is a genuine linguistic mess that defies the fundamental techniques of sentiment analysis, the task of classifying subjective text as positive, negative, or neutral.

With no availble NLP tools to utilise for Arabizi, I addressed a plethora of challenges the hard way, ingesting, preprocessing, and creating various datasets for sentiment analysis. I applied machine learning for language identification, then used deep learning to populate morphologically and orthographically rich sentiment lexicons for unsupervised sentiment classification of social media data.

Read more about the challenges and the approach here.
Visit the project's page for updates and published resources.

Industry Experience

Adarga AI, London

Recently I worked as an NLP Engineer to design the social media analysis pipeline from research to production. I developed a large-scale data ingestion library that retrieves Twitter data based on specific metrics such as date, location, and or topics of interest. I focused my research on retrieving information from the social network structure such as community detection and influencer identification.

IBM Watson, Böblingen

Previously I did an internship at IBM Watson Analytics for Social Media in Germany. One of Watson’s main strengths is mining and analysing large datasets from the internet in different languages, I was responsible for the NLP of the Arabic language. I developed a deep morphological processor that extracts gender, number, case, tense, and the base forms from richly inflected words.

More About Me

I am an optimistic being, inquisitve by nature, and constantly challenging the mainstream. I tend to find my self learning something new, especially languages as it compliments my studies. Fond of travel and adventure, I have been to many places and got lost in most of them! My energy is highly driven by sports, I have trained and competed professionaly in Swimming, Table Tennis, Chess, and Brasilian Jiu Jitsu but now keeping a fair balance with education.

Want to collaborate on Low Resourced, Multilingual, or Arabic NLP? Get in Touch!