Senior Software Engineer, Infrastructure/Networking (GPU) - North America/ EU

Location

Remote restrictions apply

See all remote locations

Salary

US$100K - 180K

Min. experience

5+ years

Required skills

Kubernetes AI PythonInfrastructure

Full-time role

Posted a month ago

Apply now

Actively recruiting / 168 applicants

We’re here to help you

Sole is in direct contact with the company and can answer any questions you may have. Email

Sole, Recruiter

Company Description

Odyn Network is at the forefront of artificial intelligence innovation, building transformative AI solutions that demand cutting-edge, high-performance infrastructure. Our mission is to accelerate AI development through scalable, efficient, and reliable systems. Join us to shape the future of AI infrastructure and power groundbreaking machine learning workloads.

Role Description

As a Senior Engineer with development background, you will be the technical owner of how GPU resources are scheduled, shared, and scaled across Generative AI workloads. Your expertise will directly drive faster experiments, higher model throughput, and significant cost savings per training run. If you’re passionate about transforming heterogeneous GPU fleets into a unified, high-efficiency “supercomputer,” this role is your opportunity to make a massive impact.

Included Responsibilities

Orchestrate GPU Clusters: Design, implement, load balance, and manage multi-tenant GPU clusters (on-premises, cloud, or hybrid) using Kubernetes, Slurm, or similar platforms, ensuring high utilization, fairness, and reliability.
Optimize Resource Placement and Sharing: Develop topology-aware schedulers and plugins (RoCE, NUMA, NVLink, PCIe, InfiniBand) leveraging MIG/MPS, preemption, quotas, and bin-packing strategies to achieve effective GPU utilization.
Automate Capacity and Autoscaling: Build workload-aware autoscaling systems for training and inference workloads (using tools like Ray, Run:AI, Volcano, or KubeFlow), integrating spot/preemptible strategies with checkpointing and graceful eviction.
Enhance Observability and SLOs: Implement deep telemetry for GPUs, network fabric, and jobs using Prometheus, Grafana, or OpenTelemetry. Define and monitor SLOs (e.g., queue time, runtime variance, failure rate) and create actionable dashboards and alerts.
Maximize Throughput and Cost Efficiency: Profile and optimize NCCL/CUDA/ROCm, GPUDirect RDMA, RoCEv2, and InfiniBand fabric parameters to minimize idle time and fragmentation. Model and report cost per GPU-hour and per training step to drive efficiency.
Optimize Storage and Data Paths: Collaborate on high-throughput I/O systems (Lustre, BeeGFS, Ceph, S3, NVMeoF, Alluxio caching) and dataset prefetching/checkpoint pipelines to ensure GPUs remain fully utilized.
Build Platform Glue: Develop Kubernetes operators, controllers, and admission webhooks; enforce multi-tenancy through RBAC, network policies, and quotas; and integrate with CI/CD pipelines (GitHub Actions, Argo CD) and secrets management (Vault).
Partner with ML Teams: Translate AI model requirements (e.g., DLRM, LLM pretraining/finetuning, diffusion, retrieval) into optimized cluster policies, instance configurations, and job templates that deliver seamless performance.

Qualification Requirements

2+ years building and operating distributed infrastructure, focused on compute/accelerator fleets, managing production clusters with GPUs.
Kubernetes (device plugins, operators, CRDs) and/or Slurm (partitions, QoS, fair-share).
Ray, Run:AI, or Volcano. Systems knowledge, including Linux internals (cgroups v2, eBPF basics), networking (ECN, pacing), and container technologies (Docker, CRI-O, Containerd).
CUDA/NCCL or ROCm/RCCL, MIG/MPS, topology-aware scheduling, and high-speed interconnects (InfiniBand HDR/NDR, RoCEv2, 100GbE and above).
Proficiency in Python (or GO/C++) for automation and tooling, plus infrastructure-as-code (Terraform, Helm, Ansible).
Experience with AI workload storage systems (Lustre, BeeGFS, Ceph, S3) and checkpointing strategies.
Data-driven mindset, with a track record of building utilization/queue-time dashboards, running A/B tests on schedulers, and delivering measurable performance gains.
Communication: Ability to collaborate with cross-functional teams, translating complex ML workload needs into robust infrastructure solutions.

Preferred Qualifications

Experience managing H100/A100, L40S, or MI300 GPU fleets and planning NVLink/NVSwitch configurations.
Expertise in inference serving at scale (e.g., Triton, KServe), tokenizer offloading, or KV-cache sharding.
Familiarity with cost modeling and FinOps for hybrid GPU fleets, including purchase vs. lease vs. cloud strategies.
Knowledge of multi-tenant cluster security (Pod Security, SELinux/AppArmor, image signing, network policies).
Understanding of queueing theory, bin-packing heuristics, or simulation tools (e.g., SimPy) for policy design.

What We Offer

Competitive salary and compensation packages.
Flexible work arrangements, including remote options.
Opportunities to work with cutting-edge AI technologies and collaborate with world-class AI researchers and engineers.

Ready to build the infrastructure that powers the next generation of Native Cloud Generative AI? Please submit your resume and a cover letter highlighting your experience with GPU clusters and resource optimization. We are an equal opportunity employer. We value diversity and are committed to fostering an inclusive workplace for all.