We are seeking an experienced AI Data Scientist with deep expertise in Large Language Models (LLMs) to lead efforts in model evaluation, tracing, optimization, and distillation. You will play a key role in enhancing the performance, reliability, and safety of generative AI systems across our consumer and B2B products including football chatbots, scouting assistants, and fan-facing analytics platforms.
About the Role
What You’ll Do
- Design robust evaluation frameworks for LLMs (e.g., hallucination detection, factual accuracy, prompt robustness, calibration).
- Use model tracing tools (e.g., OpenAI trace logs, LangSmith, Weights & Biases) to analyze model behavior and failuremodes.
- Lead or assist with LLM distillation projects, compressing large foundation models into smaller performant versions fine-tuned on football-specific domains.
- Create and test structured prompting strategies, build RAG pipelines, and implement safety/guardrails.
- Curate and synthesize football-specific data and knowledge graphs for fine-tuning, evaluation, and few-shot performance.
- Deploy dashboards and alerts to monitor drift, toxicity, bias, or degradation in real-world usage.Collaborate with engineers, designers, and product teams to bring LLM-powered features to life in production.
Qualifications
- 3+ years experience in applied AI/ML roles with at least 1–2 years focused on LLMs or foundation models.
- Deep understanding of transformer-based models, their limitations, and evaluation strategies.
- Proficiency with Python and key libraries: Hugging Face Transformers, LangChain, OpenAI API, PyTorch/TensorFlow.
- Experience with tracing/debugging LLM outputs using tools like LangSmith, W&B Traces, or custom logs.
- Experience with model distillation, quantization, or fine-tuning on domain-specific tasks.
- Strong communication skills with the ability to translate complex model behavior into actionable
- insights.
Preferred Skills
- Prior experience building or maintaining RAG pipelines (e.g., vector databases, retrieval logic, hybrid search).
- Knowledge of sports analytics or passion for football.
- Familiarity with prompt evaluation benchmarks (e.g., TruthfulQA, MMLU, ARC) or custom evaluation harnesses.
- Experience with safety/guardrail frameworks (e.g., OpenAI Moderation, Guardrails.ai, Rebuff).