Taha Tobaili Wagrain, Austria 2018

Taha Tobaili

NLP Engineer | Entrepreneur | Athlete
London, UK

Welcome to my homepage!

Herzlich Willkommen, Bienvenido, أهلاً وسهلاً, Hoşgeldiniz!

I could describe myself as the jack of many trades, master of some.. I enjoy a skill pallete of different disciplines in life.

Solving linguistic complexities to develop Natural Language Understanding (NLU) machines has always sparked my interest, specifically in dealing with under-resourced languages or multi-lingual contexts. This stems from my interest in languages in general and affective computing, the study of teaching machines an AI capability to process and react to human emotion.

For that I pursued a PhD in Natural Language Processing (NLP) at the Knowledge Media Institute, the Open University, Milton Keynes campus in UK, and completed it in November 2020. Since then I have been in the relevant industry, mostly as an NLP engineer, learning, designing, and building in language technologies.

In parallel, I have been building and testing ideas in Education Technology (EdTech) as a side hustle.


Amazon Alexa, Cambridge

Amazon is famous for their Leadership Principles; living these principles was the biggest take from this experience. I worked as a language engineer in the Alexa ASR (automatic speech recognition) team, our task is to identify customer speech correctly and convert it into text. Not limited to one language but my contributions have been mainly for the new Arabic Saudi Arabian locale.

Adarga AI, London

During my PhD I took on a data science internship at Adarga AI in London. Adarga had platforms that collects and analyse news information from online sources. They were interested in collecting and analysing social media data as well, so I developed an ingestion library that retrieves Twitter data based on specific metrics such as date, location, and or topics of interest. I then focused on retrieving relevant information from the social network structure such as detecting emerging communities and identifying social influencers.

IBM Watson, Böblingen

Also during my PhD I did an internship at IBM Watson Analytics for Social Media in Germany. One of Watson’s main strengths was mining and analysing large datasets from the internet in different languages; I was responsible for the Arabic NLP. Unlike other languages, simplifying Arabic text is quite sophisticated. As such, I developed a rich morphological processor from scratch that boils down inflected words to their base forms by simplifying the linguistic gender, number, case, and tense.

Sentiment Analysis for Low-Resourced Languages on Social Media

PhD Thesis

In 2016 I crossed the Mediterranean from the Middle East to spend four years in the UK and one in Germany researching and analysing the complex multilingual social media text. Lebanon, my home country, is one of the very few in the entire region where people constantly switch among English or French as they speak their dialectal Arabic, a dialect that is heavily influenced by French and Turkish anyway. Recently, the modern generation started to reflect their natural multilingualism in social text by Latinising the dialectal Arabic without the need to switch between different scriptures. This newly-formed language became known as Arabizi.

As it is extremely low-resourced and overlooked in the literature of NLP, Latinised Arabic or Arabizi made a perfect case for my PhD study. Let alone being low in NLP resources and code-switched with English and French, it is rich in morphology, and distinctively lacks a standard orthography. It is a genuine linguistic mess that defies the fundamental techniques of sentiment analysis, the task of classifying subjective text as positive, negative, or neutral.

With no availble NLP tools to utilise for Arabizi, I addressed a plethora of challenges the hard way, ingesting, preprocessing, and creating various datasets for sentiment analysis. I applied machine learning for language identification, then used word embeddings to populate morphologically and orthographically rich sentiment lexicons for unsupervised sentiment classification of social media data.

Read more about the challenges and the approach here. Visit the project's page for papers, presentations, and published resources.

More About Me

Here is my biased opinion about myself: I am an optimistic being, always looking at the bright side of things. Critical thinker; I do not hold a strong opinion in one place, rather I challenge the ones who do.

I am extremely driven by brain-stimulating activites such as chess, coding, learning a language, or practicing sports. I learned Spanish, German, and a bit of Turkish, however nothing quenches my brain as much as highly technical sports. I used to train and compete in chess and tabletennis, until I found Brasilian jiu jitsu, the tactical human form of chess, which swiftly became my number one sport obsession though I swim, cycle, and ski regularly for leisure.

Science, languages, and sports is where most of my intellect and energy is served. I do not aim to peak in one, but I comfortably mastered and competed in all and I am always expanding them on different fronts.

Get in touch, say hi :)