Mid-Level Software Engineer – AI Cloud & LLM Infrastructure
Full-Time · Remote or Hybrid · Founding Team Opportunity
About Us
We are building a Gen AI Acceleration Cloud an end-to-end platform for the full generative AI lifecycle. Our focus is to deliver blazing-fast LLM inference, scalable fine-tuning, and modern AI cloud infrastructure that GPUs, SmartNICs/DPUs, and ultra-fast networking fabrics.
Our platform powers mission-critical workloads with:
● On-demand & managed Kubernetes clusters
● Slurm-based training clusters
● High-performance inference services
● Distributed fine-tuning and eval pipelines
● Global data centers &heterogeneous GPU fleets
We are looking for a jr-mid Software Engineer to design, build, and scale the core systems behind our AI cloud.
What You’ll Work On
AI Cloud Infrastructure
- Develop and maintain reliable backend services running across cloud data centers.
- Assist in building automation for GPU management, VM provisioning, and high-throughput storage systems.
- Contribute to distributed systems and pipelines that support AI workloads.
LLM & GPU Virtualization Platform
- Help build the software layer for GPU clusters with modern accelerators (H100, GB200, GB300).
- Work on GPU virtualization and management (PCIe passthrough, MIG, SR-IOV) under guidance.
- Support scaling and optimization of storage and data systems for AI training datasets.
Observability, Reliability & Automation
- Contribute to monitoring and observability stacks (Prometheus, Grafana, OpenTelemetry).
- Help implement automated node lifecycle management for distributed training and inference.
- Assist in building testing frameworks for resiliency and fault tolerance.
Core Platform Engineering
- Contribute to internal and open-source platform components.
- Build developer tooling, SDKs, and documentation for platform services.
- Support research and implementation for decentralized AI workloads under senior guidance.
Requirements
- 2–5 years of production software engineering experience.
- Proficiency in at least one backend language (Golang preferred; Python or Rust also valued).
- Experience contributing to distributed systems or high-performance services.
Cloud & Systems Knowledge
- Familiarity with cloud platforms (AWS, GCP, or Azure) and distributed microservices.
- Understanding of concurrency, memory management, and high-performance I/O.
- Exposure to system design and reliability concepts.
Infrastructure / DevOps Skills (Plus)
- Experience with Kubernetes, Docker, or similar container orchestration.
- Familiarity with Terraform, Ansible, CI/CD pipelines, and monitoring tools.
Virtualization & Compute (Optional / Nice to Have)
- Exposure to GPU virtualization, CUDA, or distributed ML training stacks.
- Basic understanding of hypervisors or PCIe passthrough.
Networking (Optional / Nice to Have)
- Familiarity with VLAN/VXLAN, RDMA/Infiniband, or high-performance networking concepts.
Responsibilities
- Build and maintain backend and infrastructure components for AI workloads.
- Collaborate with senior engineers on GPU clusters, storage systems, and virtualization platforms.
- Assist in end-to-end service delivery from design to operation.
- Contribute to testing frameworks and automation for reliability.
- Work closely with cross-functional teams including ML engineers, product, and hardware teams.
Who You Are
- A technically curious engineer who enjoys complex systems work.
- Able to communicate ideas clearly and document work for others.
- Motivated by building infrastructure that supports cutting-edge AI.
- Collaborative, adaptable, and comfortable in a fast-moving startup environment.