Sole is in direct contact with the company and can answer any questions you may have. Email
CLI Test and Integration Engineer (Chaos testing, integration tests)
Full-Time · Remote or Hybrid · High-Impact Role
Odyn is at the forefront of AI innovation, building transformative AI solutions through cutting-edge,
high-performance infrastructure. We're seeking a CLI Test and Integration Engineer to design and
execute chaos engineering experiments, build integration test suites, and ensure our GPU
infrastructure withstands real-world failure scenarios.
Chaos Engineering & Testing
● Build hypothesis-driven chaos experiments using Gremlin, Chaos Monkey, LitmusChaos, or
AWS FIS to inject controlled failures across GPU infrastructure, schedulers, API gateways,
and storage layers.
● Design automated integration tests for distributed AI infrastructure components and end-to-
end workflows.
● Build CLI testing frameworks for developer and operator tools, validating behavior across
environments and edge cases.
CI/CD & System Validation
● Embed chaos, integration, and CLI tests into CI/CD pipelines (GitHub Actions, GitLab CI,
ArgoCD, Jenkins) with intelligent orchestration and automated rollback.
● Test platform behavior under network partitions, node failures, high-load scenarios, and
degraded performance.
● Validate failover mechanisms, data replication, and observability systems during failures.
Collaboration &Culture
● Partner with SRE, infrastructure, and backend teams to improve system resilience and
testability.
● Conduct architecture reviews to identify weaknesses and create incident response
documentation.
● 5–7+ years in test automation, chaos engineering, SRE, or distributed systems testing.
● Hands-on chaos engineering experience (Gremlin, Chaos Monkey, LitmusChaos, AWS FIS).
● Strong integration testing experience with distributed systems and cloud-native architectures.
● Proficiency in Python and/or Go; deep experience with pytest, Robot Framework, Playwright,
or similar.
● Kubernetes expertise and cloud platform experience (AWS/GCP/Azure).
● CI/CD pipeline integration and strong Linux/Unix skills.
● GPU workload, HPC, or AI/ML infrastructure testing experience.
● High-performance networking (InfiniBand, RoCE, NVLink) or GPU schedulers (Kubernetes,
Slurm, Ray).
● Observability stacks (Prometheus, Grafana, OpenTelemetry) or infrastructure-as-code
(Terraform, Ansible).
● Prior experience at Netflix, Google, AWS, or AI infrastructure startups.
● Shape reliability of a cutting-edge AI infrastructure platform from the ground up.
● Work at the frontier of chaos engineering applied to GPU infrastructure and distributed AI
systems.
● Collaborate with world-class SRE and infrastructure teams.
● Competitive compensation + remote flexibility.
We strongly encourage applications from those with chaos engineering or distributed systems testing
experience for GPU clusters, Kubernetes, or AI/ML platforms.