We are seeking an experienced AI Data Scientist with deep expertise in Large Language Models (LLMs) to lead efforts in model evaluation, tracing, optimization, and distillation. You will play a key role in enhancing the performance, reliability, and safety of generative AI systems across our consumer and B2B products including football chatbots, scouting assistants, and fan-facing analytics platforms.

About the Role

What You’ll Do

Design robust evaluation frameworks for LLMs (e.g., hallucination detection, factual accuracy, prompt robustness, calibration).
Use model tracing tools (e.g., OpenAI trace logs, LangSmith, Weights & Biases) to analyze model behavior and failuremodes.
Lead or assist with LLM distillation projects, compressing large foundation models into smaller performant versions fine-tuned on football-specific domains.
Create and test structured prompting strategies, build RAG pipelines, and implement safety/guardrails.
Curate and synthesize football-specific data and knowledge graphs for fine-tuning, evaluation, and few-shot performance.
Deploy dashboards and alerts to monitor drift, toxicity, bias, or degradation in real-world usage.Collaborate with engineers, designers, and product teams to bring LLM-powered features to life in production.

Qualifications

3+ years experience in applied AI/ML roles with at least 1–2 years focused on LLMs or foundation models.
Deep understanding of transformer-based models, their limitations, and evaluation strategies.
Proficiency with Python and key libraries: Hugging Face Transformers, LangChain, OpenAI API, PyTorch/TensorFlow.
Experience with tracing/debugging LLM outputs using tools like LangSmith, W&B Traces, or custom logs.
Experience with model distillation, quantization, or fine-tuning on domain-specific tasks.
Strong communication skills with the ability to translate complex model behavior into actionable
insights.

Preferred Skills

Prior experience building or maintaining RAG pipelines (e.g., vector databases, retrieval logic, hybrid search).
Knowledge of sports analytics or passion for football.
Familiarity with prompt evaluation benchmarks (e.g., TruthfulQA, MMLU, ARC) or custom evaluation harnesses.
Experience with safety/guardrail frameworks (e.g., OpenAI Moderation, Guardrails.ai, Rebuff).