Our client is looking for a Reliability Data Scientist to design evaluation scenarios, datasets, and metrics that reveal real risks in production AI. This role sits at the intersection of data science, evaluation design, and AI monitoring—supporting reliability dashboards, weekly reports, and client triage workflows.

Key Responsibilities:

Design evaluation scenarios and metric frameworks to assess AI quality, suitability, reliability, and context-dependent behavior.
Build and maintain evaluation assets including datasets, golden traces, error taxonomies, and automated scoring/aggregation pipelines in partnership with engineering.
Develop and manage weekly reliability dashboards and automated reports, translating monitoring data into clear insights.
Analyze evaluation results to detect drift, outliers, context-driven failures, and calibration issues—validating evaluator reliability against human judgments.
Document test logic, metric definitions, and interpretation guidance, and support context-engineering workflows with metrics for predictability, observability, and directability.

Ideal candidate has 3–6 years of experience and brings:

Strong Python + SQL + data-wrangling skills
Hands-on experience with evaluation design, sampling, and calibration
Comfort with dashboards (Grafana, PowerBI, or similar)
Experience building golden datasets and structured evaluation traces
Exposure to LLM or AI system evaluation (preferred)
Experience in regulated industries (audit, finance, healthcare) is a plus
Excellent communication — ability to turn technical data into decision-ready insights

Start Date: December 2025

Duration: 4-6 months

Time Commitment: ~20 hours/week

Location: Remote, in the U.S.

Expected rate: US$100-$120 per hour

Project ID#: 8021

**This is a contract role and does not offer health benefits.