For companies
  • Hire developers
  • Hire designers
  • Hire marketers
  • Hire product managers
  • Hire project managers
  • Hire assistants
  • How Arc works
  • How much can you save?
  • Case studies
  • Pricing
    • Remote dev salary explorer
    • Freelance developer rate explorer
    • Job description templates
    • Interview questions
    • Remote work FAQs
    • Team bonding playbooks
    • Employer blog
For talent
  • Overview
  • Remote jobs
  • Remote companies
    • Resume builder and guide
    • Talent career blog
Arc Exclusive
Arc Exclusive

Senior Software Engineer (Generative AI Cloud Infrastructure) - Perm - US/UK/Europe

Location

Remote restrictions apply
See all remote locations

Salary

US$120K - 200K

Min. experience

5+ years

Required skills

GolangMicroservicesKubernetesSystem designLarge scale distributed systemTerraformAnsibleCI/CD

Full-time role
Posted 6 hours ago
Apply now
Actively recruiting / 4 applicants

We’re here to help you

Sole is in direct contact with the company and can answer any questions you may have. Email

SoleSole, Recruiter

Full-Time · Remote or Hybrid · Founding Team Opportunity

About Us

We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.

Our platform powers mission-critical workloads with:
● On-demand & managed Kubernetes clusters
● Slurm-based training clusters
● High-performance inference services
● Distributed fine-tuning and eval pipelines
● Global data centers &heterogeneous GPU fleets
We are looking for a Senior Software Engineer to design, build, and scale the core systems behind our AI cloud.

What You’ll Work On

High-Performance AI Cloud Infrastructure
● Design and maintain fault-tolerant, high-availability backend services running across global data centers.
● Build operators and automation systems for:
○ GPU management
○ Infiniband partitioning
○ VM provisioning
○ High-throughput storage provisioning

LLM & GPU Virtualization Platform
● Build the IaaS software layer for new GPU clusters with thousands of next-gen accelerators (H100, GB200, GB300).
● Work on scalable GPU virtualization (PCIe passthrough, MIG, SR-IOV, VFIO). Massive-Scale Storage & Data Systems
● Contribute to a global multi-exabyte, high-performance object store optimized for pretraining datasets.
Build distributed data loaders, caching layers, metadata services, and throughput-optimized pipelines.

Observability, Reliability &Automation
● Develop advanced observability stacks (Prometheus, Grafana, OpenTelemetry).cDesign automated node lifecycle management for large-scale distributed training and inference.
● Build robust testing frameworks for resiliency, failover, and fault tolerance. Core Platform Engineering
● Contribute to the core internal + open-source platform components.
● Write tooling, SDKs, and documentation for developer-facing services.
● Research decentralized AI workloads and build reference architectures.

Requirements

Fundamentals
● 5+ years of production software engineering experience.
● Strong proficiency in one or more backend languages (Golang highly preferred; Rust/Python also valued).
● 5+ years building high-performance, well-tested, production-grade distributed services.

Cloud & Systems Experience
● Experience with distributed microservices across AWS/GCP/Azure.
● Deep understanding of systems fundamentals:
○ Concurrency
○ Memory management
○ High-performance I/O
○ Distributed consensus
○ Large-scale system design

Kubernetes / Infrastructure Expertise (Big Plus)
● Kubernetes internals: custom operators, CRDs, schedulers, or networking/storage plugins.
● Experience with Cluster API, KubeVirt, or similar orchestration tooling. Virtualization / Compute (Big Plus)
● Experience with hypervisors (QEMU/KVM, cloud-hypervisor).
● PCIe passthrough, SR-IOV, GPU virtualization, MIG, NVLink topologies.
● Experience with DPUs/SmartNICs.

Networking (Big Plus)
● Infiniband / RDMA
● VLAN/VXLAN/VPC
● OVS/OVN
● High-performance DC networking

High-Performance Compute (Plus)
● CUDA, NCCL, GPU drivers, parallel training stacks
● Experience with GPU scheduling, workloads, and distributed ML

Infrastructure Automation &Tooling (Expected)
● Terraform, Ansible, CI/CD
● GitHub Actions, ArgoCD
● Prometheus, Grafana, ELK, OpenTelemetry

Preferred Experience

● Built or operated IaaS/PaaS systems
● Experience with large-scale storage systems (Ceph, Lustre, or custom object stores)
● Knowledge of vLLM, TensorRT-LLM, TGI, or other LLM-serving frameworks
● Experience building infra for ML, training, inference, or fine-tuning

Responsibilities

● Perform architecture &research for distributed and decentralized AI workloads.
● Build and maintain foundational infrastructure powering training, inference, and fine-tuning.
● Contribute to core, open-source platform components.
● Own end-to-end services from design → implementation → operations.
● Create testing frameworks for robustness, failover, and performance.
● Collaborate across hardware, product, and ML teams to design next-gen infra.

Who You Are

● A deeply technical engineer who thrives in complex systems work.
● Strong communicator who writes clear design docs.
● Curious, low-ego, and great at collaborating with cross-functional teams.
● Motivated by building world-class AI infrastructure from the ground up.
● Thrives in zero-to-one, fast-moving startup environments.

Compensation

● Competitive salary
● Meaningful early equity
● Benefits
● Salary determined by experience and location.

Unlock all Arc benefits!

  • Browse remote jobs in one place
  • Land interviews more quickly
  • Get hands-on recruiter support
PRODUCTS
Arc

The remote career platform for talent

Codementor

Find a mentor to help you in real time

LINKS
About usPricingArc Careers - Hiring Now!Remote Junior JobsRemote jobsCareer Success StoriesTalent Career BlogArc Newsletter
JOBS BY EXPERTISE
Remote Front End Developer JobsRemote Back End Developer JobsRemote Full Stack Developer JobsRemote Mobile Developer JobsRemote Data Scientist JobsRemote Game Developer JobsRemote Data Engineer JobsRemote Programming JobsRemote Design JobsRemote Marketing JobsRemote Product Manager JobsRemote Project Manager JobsRemote Administrative Support Jobs
JOBS BY TECH STACKS
Remote AWS Developer JobsRemote Java Developer JobsRemote Javascript Developer JobsRemote Python Developer JobsRemote React Developer JobsRemote Shopify Developer JobsRemote SQL Developer JobsRemote Unity Developer JobsRemote Wordpress Developer JobsRemote Web Development JobsRemote Motion Graphic JobsRemote SEO JobsRemote AI Jobs
© Copyright 2026 Arc
Cookie PolicyPrivacy PolicyTerms of Service