I worked on an end-to-end NLP project for Sentiment Analysis of Arabizi text (Arabic written in Latin characters, commonly used in social media).
The project is built using Transformer-based models and focuses on handling the challenges of noisy, informal, and mixed-language text.
? Project Pipeline:
We started by collecting and exploring a suitable dataset for sentiment analysis.
Since most available data is in Arabic, we used LangChain + Gemini model to convert and normalize Arabic text into Arabizi format, making the dataset consistent and usable for training.
A custom tokenizer was trained/optimized to properly understand Arabizi structure, including slang, abbreviations, and mixed character patterns.
We then fine-tuned a Transformer-based model for sentiment classification (Positive / Negative / Neutral).
The model achieved +87% accuracy, showing strong performance on noisy real-world text.
? Handling Mixed / Complex Sentences:
For sentences that contain mixed sentiments or multiple parts, we designed a scoring strategy where:
The sentence is split into meaningful segments
Each segment gets its own sentiment score
The final sentiment is calculated based on the highest weighted score, giving a more accurate overall prediction
? Technologies Used:
Transformers, LangChain, Gemini API, Python, NLP preprocessing, Fine-tuning pipelines
? Outcome:
A robust sentiment analysis system capable of understanding Arabizi text with high accuracy, even in challenging real-world scenarios involving slang, noise, and mixed expressions.