Senior Software Engineer (Generative AI Cloud Infrastructure) - Perm - US/UK/Europe

Location

Remote restrictions apply

See all remote locations

Salary

US$120K - 200K

Min. experience

5+ years

Required skills

Golang Microservices KubernetesSystem designLarge scale distributed system Terraform AnsibleCI/CD

Full-time role

Posted 2 months ago

Apply now

Actively recruiting / 302 applicants

We’re here to help you

Sole is in direct contact with the company and can answer any questions you may have. Email

Sole, Recruiter

Full-Time · Remote or Hybrid · Founding Team Opportunity

About Us

We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.

Our platform powers mission-critical workloads with:
● On-demand & managed Kubernetes clusters
● Slurm-based training clusters
● High-performance inference services
● Distributed fine-tuning and eval pipelines
● Global data centers &heterogeneous GPU fleets
We are looking for a Senior Software Engineer to design, build, and scale the core systems behind our AI cloud.

What You’ll Work On

High-Performance AI Cloud Infrastructure
● Design and maintain fault-tolerant, high-availability backend services running across global data centers.
● Build operators and automation systems for:
○ GPU management
○ Infiniband partitioning
○ VM provisioning
○ High-throughput storage provisioning

LLM & GPU Virtualization Platform
● Build the IaaS software layer for new GPU clusters with thousands of next-gen accelerators (H100, GB200, GB300).
● Work on scalable GPU virtualization (PCIe passthrough, MIG, SR-IOV, VFIO). Massive-Scale Storage & Data Systems
● Contribute to a global multi-exabyte, high-performance object store optimized for pretraining datasets.
Build distributed data loaders, caching layers, metadata services, and throughput-optimized pipelines.

Observability, Reliability &Automation
● Develop advanced observability stacks (Prometheus, Grafana, OpenTelemetry).cDesign automated node lifecycle management for large-scale distributed training and inference.
● Build robust testing frameworks for resiliency, failover, and fault tolerance. Core Platform Engineering
● Contribute to the core internal + open-source platform components.
● Write tooling, SDKs, and documentation for developer-facing services.
● Research decentralized AI workloads and build reference architectures.

Requirements

Fundamentals
● 5+ years of production software engineering experience.
● Strong proficiency in one or more backend languages (Golang highly preferred; Rust/Python also valued).
● 5+ years building high-performance, well-tested, production-grade distributed services.

Cloud & Systems Experience
● Experience with distributed microservices across AWS/GCP/Azure.
● Deep understanding of systems fundamentals:
○ Concurrency
○ Memory management
○ High-performance I/O
○ Distributed consensus
○ Large-scale system design

Kubernetes / Infrastructure Expertise (Big Plus)
● Kubernetes internals: custom operators, CRDs, schedulers, or networking/storage plugins.
● Experience with Cluster API, KubeVirt, or similar orchestration tooling. Virtualization / Compute (Big Plus)
● Experience with hypervisors (QEMU/KVM, cloud-hypervisor).
● PCIe passthrough, SR-IOV, GPU virtualization, MIG, NVLink topologies.
● Experience with DPUs/SmartNICs.

Networking (Big Plus)
● Infiniband / RDMA
● VLAN/VXLAN/VPC
● OVS/OVN
● High-performance DC networking

High-Performance Compute (Plus)
● CUDA, NCCL, GPU drivers, parallel training stacks
● Experience with GPU scheduling, workloads, and distributed ML

Infrastructure Automation &Tooling (Expected)
● Terraform, Ansible, CI/CD
● GitHub Actions, ArgoCD
● Prometheus, Grafana, ELK, OpenTelemetry

Preferred Experience

● Built or operated IaaS/PaaS systems
● Experience with large-scale storage systems (Ceph, Lustre, or custom object stores)
● Knowledge of vLLM, TensorRT-LLM, TGI, or other LLM-serving frameworks
● Experience building infra for ML, training, inference, or fine-tuning

Responsibilities

● Perform architecture &research for distributed and decentralized AI workloads.
● Build and maintain foundational infrastructure powering training, inference, and fine-tuning.
● Contribute to core, open-source platform components.
● Own end-to-end services from design → implementation → operations.
● Create testing frameworks for robustness, failover, and performance.
● Collaborate across hardware, product, and ML teams to design next-gen infra.

Who You Are

● A deeply technical engineer who thrives in complex systems work.
● Strong communicator who writes clear design docs.
● Curious, low-ego, and great at collaborating with cross-functional teams.
● Motivated by building world-class AI infrastructure from the ground up.
● Thrives in zero-to-one, fast-moving startup environments.