Senior DevOps Engineer – Generative AI & Cloud Infrastructure
Full-Time · Remote or Hybrid · High-Impact Role
About Us
We are building a next-generation AI cloud platform combining fast LLM inference with high-performance cloud infrastructure, GPU clusters, and developer-first APIs. Our systems power mission-critical generative AI workloads across distributed data centers and cutting-edge ML hardware. We’re looking for a Senior DevOps Engineer to own and evolve the infrastructure backbone of our platform. You’ll work closely with infra, ML, and product engineering teams to design, automate, and operate reliable, scalable, and observable systems for AI workloads. If you love working at the intersection of DevOps, distributed systems, GPUs, and generative AI, this role is for you.
What You’ll Do
Design & Operate AI Cloud Infrastructure
- Build and maintain scalable, secure, and highly-available infrastructure for LLM inference, fine-tuning, and data processing.
- Manage multi-region Kubernetes clusters running GPU-heavy workloads.
- Implement and refine autoscaling strategies for heterogeneous GPU fleets.
Infrastructure as Code & Automation
- Own infrastructure-as-code deployments with tools like Terraform, Helm, and Ansible.
- Automate provisioning of compute, networking, and storage for AI clusters.
- Build pipelines to spin up and tear down clusters for experiments, benchmarks, and customer environments.
CI/CD & Release Engineering
- Design and maintain CI/CD pipelines for backend, ML, and infra components.
- Implement safe rollout strategies (blue/green, canary, feature flags).
- Collaborate with engineers to improve build times, test reliability, and deployment velocity.
Observability, Reliability & SRE
- Build and operate observability stacks (Prometheus, Grafana, Loki/ELK, OpenTelemetry).
- Define and monitor SLOs/SLAs for latency, availability, and error budgets across services.
- Implement playbooks, runbooks, and incident response processes for production systems.
Security, Compliance & Best Practices
- Implement best practices for secrets management, access control, and network security.
- Help design secure multi-tenant environments for enterprise customers.
- Partner with leadership to build a culture of reliability, ownership, and operational excellence.
What We’re Looking For...
Must-Have
- 4–8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles.
- Strong experience operating production systems on AWS / GCP / Azure.
- Deep experience with Kubernetes in production (cluster management, Helm, operators, networking, storage).
- Proficiency with infrastructure-as-code (Terraform or similar).
- Strong skills in at least one scripting/programming language (Python, Go, Bash, etc.).
- Solid understanding of networking, load balancers, DNS, TLS, and security fundamentals.
- Proven track record of building reliable, observable, and automated systems.
Nice-to-Have
- Experience with GPU-based workloads and ML infrastructure (H100s, A100s, GB200s, etc.).
- Familiarity with LLM inference stacks, ML training pipelines, or data platforms.
- Experience with: Service meshes and API gateways; GitHub Actions, ArgoCD, or similar CI/CD tools; Prometheus, Grafana, Loki, Tempo, OpenTelemetry
- Exposure to high-throughput storage systems, object stores, or distributed filesystems.
- Prior experience in an AI infra, cloud platform, or high-scale SaaS startup.
Who You Are
- You think in systems and love reducing complexity with automation.
- You’re calm under pressure and comfortable owning production systems.
- You enjoy partnering closely with engineers and aren’t afraid to dive into code.
- You care about reliability, performance, and craftsmanship.
- You thrive in fast-moving, zero-to-one startup environments.
Why Join Us
- Work on the core infrastructure powering cutting-edge generative AI. Collaborate with world-class infra, ML, and product engineers.
- High ownership over architecture, tooling, and operational practices.
- Competitive compensation, equity, and strong growth potential.
- Flexible remote/hybrid environment.
How to Apply
Please reply to the application questions.