[Open to candidates based in the UK / US and Western Europe]
Full-Time · Remote OR Hybrid · Founding Team Opportunity
About Us
We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.
Our platform powers mission-critical workloads with:
● On-demand & managed Kubernetes clusters
● Slurm-based training clusters
● High-performance inference services
● Distributed fine-tuning and eval pipelines
● Global data centers &heterogeneous GPU fleets
We are looking for a Senior Software Engineer to design, build, and scale the core systems behind our AI cloud.
What You’ll Work On
High-Performance AI Cloud Infrastructure
● Design and maintain fault-tolerant, high-availability backend services running across global data centers.
● Build operators and automation systems for:
○ GPU management
○ Infiniband partitioning
○ VM provisioning
○ High-throughput storage provisioning
LLM & GPU Virtualization Platform
● Build the IaaS software layer for new GPU clusters with thousands of next-gen accelerators (H100, GB200, GB300).
● Work on scalable GPU virtualization (PCIe passthrough, MIG, SR-IOV, VFIO). Massive-Scale Storage & Data Systems
● Contribute to a global multi-exabyte, high-performance object store optimized for pretraining datasets.
Build distributed data loaders, caching layers, metadata services, and throughput-optimized pipelines.
Observability, Reliability &Automation
● Develop advanced observability stacks (Prometheus, Grafana, OpenTelemetry).cDesign automated node lifecycle management for large-scale distributed training and inference.
● Build robust testing frameworks for resiliency, failover, and fault tolerance. Core Platform Engineering
● Contribute to the core internal + open-source platform components.
● Write tooling, SDKs, and documentation for developer-facing services.
● Research decentralized AI workloads and build reference architectures.
Requirements
Fundamentals
● 5+ years of production software engineering experience.
● Strong proficiency in one or more backend languages (Golang highly preferred; Rust/Python also valued).
● 5+ years building high-performance, well-tested, production-grade distributed services.
Cloud & Systems Experience
● Experience with distributed microservices across AWS/GCP/Azure.
● Deep understanding of systems fundamentals:
○ Concurrency
○ Memory management
○ High-performance I/O
○ Distributed consensus
○ Large-scale system design
Kubernetes / Infrastructure Expertise (Big Plus)
● Kubernetes internals: custom operators, CRDs, schedulers, or networking/storage plugins.
● Experience with Cluster API, KubeVirt, or similar orchestration tooling. Virtualization / Compute (Big Plus)
● Experience with hypervisors (QEMU/KVM, cloud-hypervisor).
● PCIe passthrough, SR-IOV, GPU virtualization, MIG, NVLink topologies.
● Experience with DPUs/SmartNICs.
Networking (Big Plus)
● Infiniband / RDMA
● VLAN/VXLAN/VPC
● OVS/OVN
● High-performance DC networking
High-Performance Compute (Plus)
● CUDA, NCCL, GPU drivers, parallel training stacks
● Experience with GPU scheduling, workloads, and distributed ML
Infrastructure Automation &Tooling (Expected)
● Terraform, Ansible, CI/CD
● GitHub Actions, ArgoCD
● Prometheus, Grafana, ELK, OpenTelemetry
Preferred Experience
● Built or operated IaaS/PaaS systems
● Experience with large-scale storage systems (Ceph, Lustre, or custom object stores)
● Knowledge of vLLM, TensorRT-LLM, TGI, or other LLM-serving frameworks
● Experience building infra for ML, training, inference, or fine-tuning
Responsibilities
● Perform architecture &research for distributed and decentralized AI workloads.
● Build and maintain foundational infrastructure powering training, inference, and fine-tuning.
● Contribute to core, open-source platform components.
● Own end-to-end services from design → implementation → operations.
● Create testing frameworks for robustness, failover, and performance.
● Collaborate across hardware, product, and ML teams to design next-gen infra.
Who You Are
● A deeply technical engineer who thrives in complex systems work.
● Strong communicator who writes clear design docs.
● Curious, low-ego, and great at collaborating with cross-functional teams.
● Motivated by building world-class AI infrastructure from the ground up.
● Thrives in zero-to-one, fast-moving startup environments.
Compensation
● Competitive salary
● Meaningful early equity
● Benefits
● Salary determined by experience and location.
To succeed as a developer, I believe that the quality of your code and your personality are both important. This means not just meeting feature requirements, but also paying attention to the structure of the code. This kind of developer pays close attention to implementing best practices. Moreover, developers also need to give and receive objective feedback. This feedback isn't personal, but instead aimed at improving the team processes. Through trusting each other, we're able to grow as a team. My biggest area of growth as Front-end Lead at Arc has been in making technical decisions: identifying issues, discussing potential solutions, and analyzing the pros and cons of each option. Our team culture of deep discussion means we learn a lot from each other, and develop a thorough understanding of issues in both breadth and depth. The whole team is enabled to make both big and small decisions that impact different areas of the product. Because we use well-established processes to determine trade-offs, we’re able to make these decisions with confidence.
Future of Work and EdTech products move fast — and so do we. We like to experiment and test things, and we ship the MVP. We stay focused on building the stuff that matters.
Bring your whole self to work, and take projects from your initial idea to launch. Need help from another team? Just ask.