Principal Site Reliability Engineer
We’re seeking an experienced Principal Site Reliability Engineer to lead infrastructure strategy, reliability, and compliance efforts within FedRAMP Moderate environments.
This is a unique opportunity to join an innovative tech-first organization at a pivotal stage of growth. You’ll work closely with senior leadership to design, implement, and operate highly available systems that meet the stringent security and compliance requirements while maintaining the velocity and innovation of a modern tech startup.
The Role
As the Principal SRE, you will be the technical authority driving operational excellence across cloud infrastructure. You’ll design scalable systems, establish best practices in reliability engineering, and ensure that FedRAMP compliance is deeply integrated into every layer of the platform.
You’ll collaborate across teams, platform, security, and product to deliver infrastructure that’s both resilient and compliant, while mentoring engineers and shaping long-term platform strategy.
Key Responsibilities
- Lead design, deployment, and optimization of production infrastructure across multi-cloud environments (AWS preferred).
- Oversee and advance FedRAMP Moderate compliance initiatives, working closely with internal security teams and external 3PAOs.
- Define and enforce best practices for Infrastructure as Code (Terraform or Pulumi), GitOps, and modern CI/CD pipelines.
- Enhance system reliability and performance for containerized workloads (ECS/EKS), focusing on scalability, observability, and security.
- Develop automation and tooling using languages such as C++, Python, Go, or .NET - building systems, not just integrating them.
- Implement comprehensive observability frameworks (metrics, logs, traces) to support performance optimization and cost governance (FinOps).
- Apply strong security engineering principles across encryption, IAM, and key management under NIST, CIS, STIG, and FIPS 140-2/3 standards.
What We’re Looking For
- 10+ years of experience building and operating production-grade infrastructure.
- 5+ years in technical leadership roles within SRE, DevOps, or platform engineering.
- Proven FedRAMP Moderate experience and direct collaboration with 3PAOs.
- Expertise in Terraform or Pulumi, and strong understanding of GitOps and CI/CD.
- Proficiency with AWS, and familiarity with multi-cloud strategies.
- Deep understanding of container orchestration (ECS/EKS) and container security.
- Strong coding abilities in at least one of C++, Python, Go, or .NET, plus Bash scripting.
- Excellent written communication skills and a history of producing clear technical design documents.
- Demonstrated mentorship of senior engineers and leadership in complex technical initiatives.
Nice-to-Have Experience
- Exposure to data pipelines, real-time video, or machine learning workloads.
- Familiarity with service mesh technologies (e.g., Istio) and incident command best practices.
- Previous experience in startup environments, where adaptability and execution speed are key.
Why This Role?
This is an opportunity to play a pivotal role in shaping the infrastructure and reliability strategy of an emerging technology leader. You’ll be joining a company that combines the fast-paced innovation of a startup with the rigor of FedRAMP compliance> This is an ideal environment for an experienced Principal SRE ready to lead at scale.