Our client is looking for a Reliability Data Scientist to design evaluation scenarios, datasets, and metrics that reveal real risks in production AI. This role sits at the intersection of data science, evaluation design, and AI monitoring—supporting reliability dashboards, weekly reports, and client triage workflows.
Key Responsibilities:
- Design evaluation scenarios and metric frameworks to assess AI quality, suitability, reliability, and context-dependent behavior.
- Build and maintain evaluation assets including datasets, golden traces, error taxonomies, and automated scoring/aggregation pipelines in partnership with engineering.
- Develop and manage weekly reliability dashboards and automated reports, translating monitoring data into clear insights.
- Analyze evaluation results to detect drift, outliers, context-driven failures, and calibration issues—validating evaluator reliability against human judgments.
- Document test logic, metric definitions, and interpretation guidance, and support context-engineering workflows with metrics for predictability, observability, and directability.
Ideal candidate has 3–6 years of experience and brings:
- Strong Python + SQL + data-wrangling skills
- Hands-on experience with evaluation design, sampling, and calibration
- Comfort with dashboards (Grafana, PowerBI, or similar)
- Experience building golden datasets and structured evaluation traces
- Exposure to LLM or AI system evaluation (preferred)
- Experience in regulated industries (audit, finance, healthcare) is a plus
- Excellent communication — ability to turn technical data into decision-ready insights
Start Date: December 2025
Duration: 4-6 months
Time Commitment: ~20 hours/week
Location: Remote, in the U.S.
Expected rate: US$100-$120 per hour
Project ID#: 8021
**This is a contract role and does not offer health benefits.