For companies
  • Hire developers
  • Hire designers
  • Hire marketers
  • Hire product managers
  • Hire project managers
  • Hire assistants
  • How Arc works
  • How much can you save?
  • Case studies
  • Pricing
    • Remote dev salary explorer
    • Freelance developer rate explorer
    • Job description templates
    • Interview questions
    • Remote work FAQs
    • Team bonding playbooks
    • Employer blog
For talent
  • Overview
  • Remote jobs
  • Remote companies
    • Resume builder and guide
    • Talent career blog
Arc Exclusive
Arc Exclusive

Senior Software Engineer, Infrastructure/Networking (GPU) - North America/ EU

Location

Remote restrictions apply
See all remote locations

Salary

US$100K - 180K

Min. experience

5+ years

Required skills

PythonCUDAKubernetesAIAWS

Full-time role
Posted 9 hours ago
Apply now
Actively recruiting / 13 applicants

We’re here to help you

Sole is in direct contact with the company and can answer any questions you may have. Email

SoleSole, Recruiter

Company Description

Odyn Network is at the forefront of artificial intelligence innovation, building transformative AI solutions that demand cutting-edge, high-performance infrastructure. Our mission is to accelerate AI development through scalable, efficient, and reliable systems. Join us to shape the future of AI infrastructure and power groundbreaking machine learning workloads.

Role Description

As a Senior Engineer, you will be the technical owner of how GPU resources are scheduled, shared, and scaled across Generative AI workloads. Your expertise will directly drive faster experiments, higher model throughput, and significant cost savings per training run. If you’re passionate about transforming heterogeneous GPU fleets into a unified, high-efficiency “supercomputer,” this role is your opportunity to make a massive impact.

Included Responsibilities

  • Orchestrate GPU Clusters: Design, implement, load balance, and manage multi-tenant GPU clusters (on-premises, cloud, or hybrid) using Kubernetes, Slurm, or similar platforms, ensuring high utilization, fairness, and reliability.
  • Optimize Resource Placement and Sharing: Develop topology-aware schedulers and plugins (RoCE, NUMA, NVLink, PCIe, InfiniBand) leveraging MIG/MPS, preemption, quotas, and bin-packing strategies to achieve effective GPU utilization.
  • Automate Capacity and Autoscaling: Build workload-aware autoscaling systems for training and inference workloads (using tools like Ray, Run:AI, Volcano, or KubeFlow), integrating spot/preemptible strategies with checkpointing and graceful eviction.
  • Enhance Observability and SLOs: Implement deep telemetry for GPUs, network fabric, and jobs using Prometheus, Grafana, or OpenTelemetry. Define and monitor SLOs (e.g., queue time, runtime variance, failure rate) and create actionable dashboards and alerts.
  • Maximize Throughput and Cost Efficiency: Profile and optimize NCCL/CUDA/ROCm, GPUDirect RDMA, RoCEv2, and InfiniBand fabric parameters to minimize idle time and fragmentation. Model and report cost per GPU-hour and per training step to drive efficiency.
  • Optimize Storage and Data Paths: Collaborate on high-throughput I/O systems (Lustre, BeeGFS, Ceph, S3, NVMeoF, Alluxio caching) and dataset prefetching/checkpoint pipelines to ensure GPUs remain fully utilized.
  • Build Platform Glue: Develop Kubernetes operators, controllers, and admission webhooks; enforce multi-tenancy through RBAC, network policies, and quotas; and integrate with CI/CD pipelines (GitHub Actions, Argo CD) and secrets management (Vault).
  • Partner with ML Teams: Translate AI model requirements (e.g., DLRM, LLM pretraining/finetuning, diffusion, retrieval) into optimized cluster policies, instance configurations, and job templates that deliver seamless performance.

Qualification Requirements

  • 2+ years building and operating distributed infrastructure, focused on compute/accelerator fleets, managing production clusters with GPUs.
  • Kubernetes (device plugins, operators, CRDs) and/or Slurm (partitions, QoS, fair-share).
  • Ray, Run:AI, or Volcano. Systems knowledge, including Linux internals (cgroups v2, eBPF basics), networking (ECN, pacing), and container technologies (Docker, CRI-O, Containerd).
  • CUDA/NCCL or ROCm/RCCL, MIG/MPS, topology-aware scheduling, and high-speed interconnects (InfiniBand HDR/NDR, RoCEv2, 100GbE and above).
  • Proficiency in Python (or GO/C++) for automation and tooling, plus infrastructure-as-code (Terraform, Helm, Ansible).
  • Experience with AI workload storage systems (Lustre, BeeGFS, Ceph, S3) and checkpointing strategies.
  • Data-driven mindset, with a track record of building utilization/queue-time dashboards, running A/B tests on schedulers, and delivering measurable performance gains.
  • Communication: Ability to collaborate with cross-functional teams, translating complex ML workload needs into robust infrastructure solutions.

Preferred Qualifications

  • Experience managing H100/A100, L40S, or MI300 GPU fleets and planning NVLink/NVSwitch configurations.
  • Expertise in inference serving at scale (e.g., Triton, KServe), tokenizer offloading, or KV-cache sharding.
  • Familiarity with cost modeling and FinOps for hybrid GPU fleets, including purchase vs. lease vs. cloud strategies.
  • Knowledge of multi-tenant cluster security (Pod Security, SELinux/AppArmor, image signing, network policies).
  • Understanding of queueing theory, bin-packing heuristics, or simulation tools (e.g., SimPy) for policy design.

What We Offer

  • Competitive salary and compensation packages.
  • Flexible work arrangements, including remote options.
  • Opportunities to work with cutting-edge AI technologies and collaborate with world-class AI researchers and engineers.

Ready to build the infrastructure that powers the next generation of Native Cloud Generative AI? Please submit your resume and a cover letter highlighting your experience with GPU clusters and resource optimization. We are an equal opportunity employer. We value diversity and are committed to fostering an inclusive workplace for all.

Unlock all Arc benefits!

  • Browse remote jobs in one place
  • Land interviews more quickly
  • Get hands-on recruiter support
PRODUCTS
Arc

The remote career platform for talent

Codementor

Find a mentor to help you in real time

LINKS
About usPricingArc Careers - Hiring Now!Remote Junior JobsRemote jobsCareer Success StoriesTalent Career BlogArc Newsletter
JOBS BY EXPERTISE
Remote Front End Developer JobsRemote Back End Developer JobsRemote Full Stack Developer JobsRemote Mobile Developer JobsRemote Data Scientist JobsRemote Game Developer JobsRemote Data Engineer JobsRemote Programming JobsRemote Design JobsRemote Marketing JobsRemote Product Manager JobsRemote Project Manager JobsRemote Administrative Support Jobs
JOBS BY TECH STACKS
Remote AWS Developer JobsRemote Java Developer JobsRemote Javascript Developer JobsRemote Python Developer JobsRemote React Developer JobsRemote Shopify Developer JobsRemote SQL Developer JobsRemote Unity Developer JobsRemote Wordpress Developer JobsRemote Web Development JobsRemote Motion Graphic JobsRemote SEO JobsRemote AI Jobs
© Copyright 2025 Arc
Cookie PolicyPrivacy PolicyTerms of Service