For companies
  • Hire developers
  • Hire designers
  • Hire marketers
  • Hire product managers
  • Hire project managers
  • Hire assistants
  • How Arc works
  • How much can you save?
  • Case studies
  • Pricing
    • Remote dev salary explorer
    • Freelance developer rate explorer
    • Job description templates
    • Interview questions
    • Remote work FAQs
    • Team bonding playbooks
    • Employer blog
For talent
  • Overview
  • Remote jobs
  • Remote companies
    • Resume builder and guide
    • Talent career blog
Circle
Circle

Senior Data Scientist- AI Evaluation

Location

Remote restrictions apply
See all remote locations

Salary Estimate

N/AIconOpenNewWindows

Seniority

Senior

Tech stacks

Testing
Python
SQL
+23

Permanent role
3 days ago
Apply now

Do you have hands-on experience designing reliable evaluations for LLM/NLP features?

Do you enjoy turning messy product questions into clear study designs, metrics, and production-ready code?

About Our Team

Elsevier’s AI Evaluation team designs, builds, and operates NLP/LLM evaluation solutions used across multiple product lines. We partner with Product, Technology, Domain SMEs, and Governance to ensure our AI features are safe, effective, and continuously improving.

About The Role

As a Senior Data Scientist III, you will design and implement end-to-end evaluation studies and pipelines for AI products. You’ll translate product requirements into statistically sound test designs and metrics, build reproducible Python/SQL pipelines, run analyses and QC, and deliver concise readouts that drive roadmap decisions and risk mitigation. You’ll collaborate closely with SMEs, contribute to our shared evaluation libraries, and produce audit-ready documentation aligned with Responsible AI and governance expectations.

Responsibilities

  • Study design & metrics — Translate product questions into hypotheses, tasks/rubrics, datasets, and success criteria; define metrics (accuracy/correctness, groundedness, reliability, safety/bias/toxicity) with acceptance thresholds.
  • Pipelines & tooling — Build and maintain Python/SQL evaluation pipelines (data prep, prompt/rubric generation, LLM-as-judge with guardrails, scoring, QC, reporting); contribute to shared packages and CI.
  • Statistical rigor — Plan for power, confidence intervals, inter-rater reliability (e.g., Cohen’s κ/ICC), calibration, and significance testing; document assumptions and limitations.
  • SME integration — Partner with SME Ops and domain leads to create clear rater guidance, run calibration, monitor IRR, and incorporate feedback loops.
  • Analytics & reporting — Create analyses that highlight regressions, safety risks, and improvement opportunities; deliver crisp write-ups and executive-level summaries.
  • Governance & compliance — Produce audit-ready artifacts (evaluation plans, datasheets/model cards, risk logs); follow privacy/security guardrails and Responsible AI practices.
  • Quality & reliability — Implement test hygiene (dataset/versioning, golden sets, seed control), observability, and failure analysis; help run post-release regression monitoring.
  • Collaboration — Work closely with Product and Engineering to scope, estimate, and land evaluation work; participate in code reviews and design sessions alongside fellow Data Scientists.

Requirements

  • Education/Experience: Master’s + 3 years, or Bachelor’s + 5 years, in CS, Data Science, Statistics, Computational Linguistics, or related field; strong track record shipping evaluation or ML analytics work.
  • Technical: Strong Python and SQL; experience with LLM/NLP evaluation, data/versioning, testing/CI, and cloud-based workflows; familiarity with prompt/rubric design and LLM-as-judge patterns.
  • Statistics: Comfortable with power analysis, CIs, hypothesis testing, inter-rater reliability, and error/slice analysis.
  • Practices: Git, code reviews, reproducibility, documentation; ability to turn ambiguous product needs into executable study plans.
  • Communication: Clear written/oral communication; ability to produce crisp dashboards and decision-ready summaries for non-technical stakeholders.
  • Mindset: Ownership, curiosity, bias-for-action, and collaborative ways of working.

Nice to have

  • Experience with evaluation of retrieval-augmented or agentic systems and/or with safety/bias/toxicity measurements.
  • Familiarity with lightweight orchestration (e.g., Airflow/Prefect) and containerization basics.
  • Exposure to healthcare or education content or working with clinician/academic SMEs.

About Circle

👥5001-10000
📍Amsterdam, North Holland
🔗Website

Circle Service

How does Circle work?

Company culture

Visit company profileIconOpenNewWindows

Unlock all Arc benefits!

  • Browse remote jobs in one place
  • Land interviews more quickly
  • Get hands-on recruiter support
PRODUCTS
Arc

The remote career platform for talent

Codementor

Find a mentor to help you in real time

LINKS
About usPricingArc Careers - Hiring Now!Remote Junior JobsRemote jobsCareer Success StoriesTalent Career BlogArc Newsletter
JOBS BY EXPERTISE
Remote Front End Developer JobsRemote Back End Developer JobsRemote Full Stack Developer JobsRemote Mobile Developer JobsRemote Data Scientist JobsRemote Game Developer JobsRemote Data Engineer JobsRemote Programming JobsRemote Design JobsRemote Marketing JobsRemote Product Manager JobsRemote Project Manager JobsRemote Administrative Support Jobs
JOBS BY TECH STACKS
Remote AWS Developer JobsRemote Java Developer JobsRemote Javascript Developer JobsRemote Python Developer JobsRemote React Developer JobsRemote Shopify Developer JobsRemote SQL Developer JobsRemote Unity Developer JobsRemote Wordpress Developer JobsRemote Web Development JobsRemote Motion Graphic JobsRemote SEO JobsRemote AI Jobs
© Copyright 2025 Arc
Cookie PolicyPrivacy PolicyTerms of Service