Actively recruiting / 32 applicants
We’re here to help you
Juliana Torrisi is in direct contact with the company and can answer any questions you may have. Email
Juliana Torrisi, RecruiterAbout the role:
We're seeking a DevOps and MLOps engineer to manage deployments, pipelines, and cloud infrastructure across AWS and GCP. This is an advanced role for someone deeply familiar with infra-as-code and deploying LLM pipelines at scale.
Core Requirements:
- Infrastructure diagramming should be a core requirement
- 8+ years in DevOps / CloudOps, including containerized deployments
- Expert in Terraform, GitHub Actions, AWS CDK, and GCP deployment workflows
- Thorough knowledge of AWS CodePipelines
- Infrastructure ownership of AWS (IAM, ECS/Fargate, S3, Bedrock) and GCP (Cloud Run, Batch - experience with managing batch jobs required, Firestore)
- Solid experience with Docker, CI/CD pipelines, and automated testing
- Prior experience deploying generative AI models (e.g., Bedrock, Hugging Face, Gemini, Claude) using services like LiteLLM
- Expertise in security best practices: IAM policies, secrets management, intrusion detection
- Deep cost optimization knowledge for AI workloads in production
- Experience with AWS Cloudwatch and cost explore or other similar monitoring tools
- Experience implementing cost-tracking and model-spend monitoring (e.g., via LiteLLM)
- Familiarity with prompt management systems like AWS Bedrock Prompt Management
- Ability to set and manage model-specific quotas (e.g., Gemini 2.5 token and request limits)
- Capable of integrating new multimodal model endpoints (e.g., Imagen 3, Google Multimodal, grok-3-beta)
Additional Required Experience:
- Fluent in switching and deploying across model providers (OpenAI, Google, Anthropic, DeepSeek)
- Experience deploying secure HTTP(S) transports and server-sent events (SSE)
- Capability to debug image-generation, prompt-optimization, and vector database fallbacks (e.g., switching from Weaviate to ChromaDB)
Must be fluent in:
- Multimodal pipelines and data compliance (TLS, AES-256, SOC2-aligned tools)
- Rapid model replacement and deprecation workflows (e.g., aliasing GPT-4o to GPT-4.1)
- Monitoring and logging for file processing issues across distributed systems
- Collaborating with ML and backend teams to deploy pipelines across multiple cloud environments