Built an Arabic-first AI agent platform where businesses describe what they need in Arabic or English and get production-ready workflow agents, 2,800+ app integrations, 15 native GCC integrations, a visual workflow editor, and per-task AI model routing based on dialect and domain performance data across 7 registered models.
Engineered a full AI Autonomous GTM engine with 14 integrations and persistent cross-session memory, replacing an entire GTM department: automated prospect research, personalized multi-channel outreach, content production across 4 platforms, and programmatic SEO & GEO optimization for blog posts.
Designed and built a 1,507 production-grounded Arabic eval suite covering 7 dialects, 14 domains, and 25 models across 5 tracks (Agent Building, Tool Calls, Instruction Following, E2E Workflows, Core Domains), exposing 20-30% performance gaps in frontier models on Gulf Arabic that public leaderboards completely miss, engineered 755 proprietary adversarial edge cases requiring cultural empathy scoring, triple code-switching parsing (Najdi dialect + English finance terms + internet slang), and multi-step GCC API chain orchestration (Salla → Aramex → Moyasar → ZATCA → Taqnyat).
Built a three-layer evaluation pipeline routing every response through a deterministic JSON diff scorer for strict schema validation, a CAMeLBERT dialect guardrail (trained on 25 Arab cities, 99% confidence) rejecting dialect fakes before LLM scoring, and a Gemini 3.1 Pro (#1 top model for Arabic) judge for naturalness and task completion, generating ~18,000 scored prompt-response pairs per benchmark run with full Langfuse observability, auto-exporting high-scoring responses as SFT data and best-vs-worst model pairs as DPO training data sold directly to frontier AI labs.