Job Description:
Are you driven to build bulletproof infrastructure that keeps high-performing engineering teams moving at full speed? Do you get excited about tackling complex challenges, automating away repetitive work, and making large-scale systems run flawlessly? We’re seeking a hands-on, forward-thinking Site Reliability Engineer (SRE) to join our infrastructure team and collaborate closely with our backend software engineers to design, operate, and scale systems that are fast, reliable, and built to last. In this role, you’ll be equal parts software engineer, systems architect, incident responder, and problem-solver — ensuring our services run seamlessly 24/7 while pushing the limits of automation, observability, and operational excellence.
Should have at least 4:
- Linux Expertise: Hands-on experience administering, configuring, and troubleshooting Linux systems in production environments, including performance tuning, process management, and networking.
- Programming & Automation: Proficiency in Python (preferred) and at least one additional scripting language (e.g., Bash, Go, Ruby) to automate deployments, monitoring, and incident response.
- Infrastructure as Code (IaC): Experience with tools such as Terraform, Ansible, or CloudFormation to provision and manage infrastructure at scale.
- Cloud Infrastructure: Proven track record deploying and managing production workloads in AWS, GCP, or Azure, including compute, storage, and networking components.
- Monitoring & Observability: Skilled in using Prometheus, Grafana, Datadog, Splunk, or similar for metrics, logging, and tracing. Experience instrumenting applications for observability in collaboration with software engineering teams.
- APIs & Service Integration: Experience consuming, building, and troubleshooting REST/gRPC APIs, including handling rate limits, retries, and error recovery.
- Incident Response & Troubleshooting: Ability to diagnose and resolve complex production issues under time pressure, perform root cause analysis, and lead postmortems.
- Capacity Planning & Performance Engineering: Experience forecasting resource needs, load testing, and ensuring systems can handle growth and high traffic.
- Change Management: Familiarity with safe deployment practices, CI/CD pipelines, and risk assessment for infrastructure changes.
- SLAs, SLOs, and Error Budgets: Understanding of how to define, monitor, and maintain service reliability targets and make trade-offs between reliability and feature velocity.
Good to have at least 2:
- Collaboration: Proven ability to work closely with backend software engineers, product managers, and other stakeholders to design and operate reliable systems.
- Automation-First Mentality: Strong belief in reducing manual toil and building self-healing systems.
- Calm Under Pressure: Ability to think clearly and act decisively in high-stakes situations.
- Continuous Improvement: Always looking for ways to optimize systems, processes, and workflows.
- Analytical Thinking: Uses data and metrics to guide decision-making and evaluate success.
- Adaptability: Comfortable learning new tools, frameworks, and approaches to meet evolving business and technical needs.