An innovative AI company, backed by leading investors, is seeking a Senior Software Engineer to help shape the foundation for decentralized AI development at scale. The company provides a cutting-edge platform that enables researchers and engineers to train state-of-the-art models collaboratively, combining distributed training infrastructure with an intuitive developer experience.
The Role
This hybrid role spans both developer platform and infrastructure layers, offering the opportunity to work on two key areas:
1. AI Workload Management Platform – Developing user-friendly tools for managing AI workloads.
2. Distributed Training Infrastructure – Building high-performance infrastructure to support large-scale model training.
Key Responsibilities
Platform Development
• Develop intuitive web interfaces for AI workload management and monitoring.
• Build REST APIs and backend services using Python.
• Implement real-time monitoring and debugging tools.
• Create user-facing features for resource allocation and job scheduling.
Infrastructure Development
• Design and implement distributed training infrastructure in Rust.
• Develop high-performance networking and coordination components.
• Automate infrastructure provisioning with tools like Ansible.
• Manage cloud resources and container orchestration (Kubernetes).
• Implement scheduling systems for heterogeneous hardware (CPU, GPU, TPU).
Technical Skills & Experience
Required Skills
Platform Development:
• Strong backend development experience in Python (FastAPI, async).
• Proficiency in modern frontend frameworks (TypeScript, React/Next.js, Tailwind).
• Experience building developer tools and dashboards.
• Strong understanding of RESTful API design.
Infrastructure Development:
• Systems programming expertise with Rust.
• Hands-on experience with infrastructure automation (Ansible, Terraform).
• Proficiency in container orchestration (Kubernetes).
• Familiarity with cloud platforms (GCP preferred).
• Experience with observability tools (Prometheus, Grafana).
Nice to Have:
• Experience with GPU computing and ML infrastructure.
• Understanding of AI/ML model training architectures.
• Background in high-performance networking.
• Contributions to open-source infrastructure projects.
• Experience with real-time systems (WebSockets, streaming).