For companies
  • Hire developers
  • Hire designers
  • Hire marketers
  • Hire product managers
  • Hire project managers
  • Hire assistants
  • How Arc works
  • How much can you save?
  • Case studies
  • Pricing
    • Remote dev salary explorer
    • Freelance developer rate explorer
    • Job description templates
    • Interview questions
    • Remote work FAQs
    • Team bonding playbooks
    • Employer blog
For talent
  • Overview
  • Remote jobs
  • Remote companies
    • Resume builder and guide
    • Talent career blog
Turing.com
Turing.com

AI Benchmark Software Engineer

Location

Remote restrictions apply
See all remote locations

Salary Estimate

N/AIconOpenNewWindows

Seniority

N/A

Tech stacks

AI
Software Development
Docker
+9

Contract role
2 days ago
Apply now

About Turing:

Turing is one of the world’s fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. Turing helps customers in two ways: working with the world’s leading AI labs to advance frontier model capabilities in thinking, reasoning, coding, agentic behavior, multimodality, multilinguality, STEM, and frontier knowledge; and leveraging that work to build real-world AI systems that solve mission-critical priorities for companies.

Role Overview:

We are looking for experienced Engineers — Code / SWE to design and build high-quality multi-agent benchmark tasks based on real-world software engineering workflows.

In this role, you will create tasks grounded in real open-source code changes such as bug fixes, migrations, and refactors. These tasks are used to evaluate how effectively AI agents can understand large codebases, apply precise modifications, and produce correct, testable outputs.

You will work within a structured evaluation framework (Harbor), define clear task instructions, design verification logic, and decompose complex engineering problems across multiple specialized agents.

What does day-to-day look like:

  • Build multi-agent benchmark tasks based on real-world open-source code changes (bug fixes, migrations, refactors)
  • Work with the Harbor evaluation framework to run and validate tasks inside Docker environments
  • Write clear, precise task instructions specifying file paths, function signatures, expected behavior, and constraints
  • Design and implement Python-based verification scripts to validate correctness of agent-generated code changes
  • Create decomposition strategies that split complex code changes across multiple independent sub-agents
  • Run, debug, and refine tasks within containerized environments to ensure reproducibility and determinism
  • Evaluate task performance signals and improve task quality, clarity, and difficulty

Requirements:

  • 5+ years of experience in Python and JavaScript development
  • Experience with AI coding benchmarks (e.g., SWE-bench, Terminal-Bench)
  • Strong experience reading and navigating large open-source codebases (e.g., Django, Flask, FastAPI, Node.js, or similar)
  • Familiarity with Git workflows, including pull requests, diffs, cherry-picking, and working with specific commits
  • Comfortable working with Docker (writing Dockerfiles, building images, debugging container issues)
  • Experience writing test scripts (pytest, unittest, or custom assertion-based testing)
  • Ability to write clear, precise, and unambiguous technical specifications
  • Perks of Freelancing With Turing
  • Work on cutting-edge AI projects with leading foundation model companies
  • Collaborate on high-impact work at the frontier of LLM evaluation and reasoning
  • Remote, flexible opportunities with global teams

Offer Details:

  • Commitments Required: 8 hours per day with a 4-hour overlap with PST.
  • Employment Type: Contractor position (Note: this role does not include medical/paid leave).
  • Duration of Contract: 4 weeks; [expected start date is next week].

About Turing.com

👥201-500
📍Palo Alto, California, United States
🔗Website

Turing.com Service

Turing.com product / service
Turing.com product / service
Turing.com product / service
Turing.com product / service
Turing.com product / service

How does Turing.com work?

allows U.S. and Silicon Valley companies to hire senior pre-vetted remote developers who have robust technical & communication skills

Company culture

Visit company profileIconOpenNewWindows

Unlock all Arc benefits!

  • Browse remote jobs in one place
  • Land interviews more quickly
  • Get hands-on recruiter support
PRODUCTS
Arc

The remote career platform for talent

Codementor

Find a mentor to help you in real time

LINKS
About usPricingArc Careers - Hiring Now!Remote Junior JobsRemote jobsCareer Success StoriesTalent Career BlogArc Newsletter
JOBS BY EXPERTISE
Remote Front End Developer JobsRemote Back End Developer JobsRemote Full Stack Developer JobsRemote Mobile Developer JobsRemote Data Scientist JobsRemote Game Developer JobsRemote Data Engineer JobsRemote Programming JobsRemote Design JobsRemote Marketing JobsRemote Product Manager JobsRemote Project Manager JobsRemote Administrative Support Jobs
JOBS BY TECH STACKS
Remote AWS Developer JobsRemote Java Developer JobsRemote Javascript Developer JobsRemote Python Developer JobsRemote React Developer JobsRemote Shopify Developer JobsRemote SQL Developer JobsRemote Unity Developer JobsRemote Wordpress Developer JobsRemote Web Development JobsRemote Motion Graphic JobsRemote SEO JobsRemote AI Jobs
© Copyright 2026 Arc
Cookie PolicyPrivacy PolicyTerms of Service