Site Reliability Engineer
Drive Innovation and Transformation with ClearScale's Cloud Expertise:
ClearScale, a leading AWS Premier Consulting Partner, empowers businesses to unlock the full potential of the cloud through a wide range of services, including cloud consulting, architecture design, migration, automation, application development, and managed services. We help Fortune 500 enterprises, mid-sized businesses, 1 and startups across diverse industries like Healthcare, Finance, and Technology succeed with ambitious and transformative cloud projects. Our expertise lies in architecting, developing, and launching innovative and sophisticated solutions using the latest cutting-edge cloud technologies. Due to our continued growth and the increasing demand for our modernization and cloud-native development capabilities, we are seeking a talented and experienced AWS Hosted/Modernization Software Engineer to join our dynamic team. If you are passionate about building and modernizing applications on the AWS platform, tackling complex engineering challenges, and working with a team of top-tier cloud experts, this is your opportunity to make a significant impact.
What You'll Do:
- Execute on Observability Strategy
- Define and document standards for logging, tracing and SLO definitions for engineering teams to follow
- Propose effective ways to manage dashboards, traces, monitors, metrics and logs in Datadog
- Integrate Datadog with incident management tools and Slack
- Establish comprehensive monitoring using Datadog
- Centralize logging and developing mechanisms for efficient debugging
- Implementing systems for distributed tracing visualization
- Adopting OpenTelemetry standards across microservices
- Rolling out observability to development and production environments in close collaboration with engineering and operations teams
- Define training practices for engineering teams to adopt observability standards and operational practises for healthy and sustainable incident management processes
- Implementing POCs and demonstrating such constructs to engineering teams
- Introduce engineering practices for healthy alerting mechanisms, dashboard definitions and blind-spots elimination with a focus on eliminating alert fatigue
- Establish near real time reporting to minimize MTTA and MTTR and improve developer experience
What You'll Bring:
- Extensive experience with AWS infrastructure at scale
- Experience working in SRE, DevOps or Developer Experience teams in engineering organizations is a must
- Deep knowledge of observability tooling (Datadog, Grafana, Splunk, OTEL) and hands-on experience developing, extending and operating them across different environments including high-loaded production systems
- Expert knowledge of Terraform
- Ability to propose solutions that scales across engineering teams and balance speed of response and cognitive load
- Experience leading incident responses utilizing operational tools including logging, tracing, SLO patterns and synthetics
- Experience establishing technical roadmaps from operational strategies for SRE, DevOps or Developer Experience teams in mid to large sized organizations and ability to drive its adoption in the engineering teams
- Experience applying analytical practices to define SLAs in close coordination with engineering teams and stakeholders
- Deep understanding and experience advocating for and rolling out SRE best practices and standards for engineering teams
- Mindset of "minimal tooling for maximum impact"
- Experience with on-call rotations, creating and executing scalable practices in engineering teams
- Experience with integrating observability tooling with Teams and Slack
- Leadership skills to drive alignment between different departments and get buy-in from different stakeholders
- Exemplary oral and writing skills for technical and non-technical stakeholders
- AWS certifications are a plus
Our Commitment to Your Growth and Well-being:
- Competitive salary
- Exceptional opportunities for career growth and leadership development within a leading AWS Premier Consulting Partner.
- A collaborative, high-energy, and fully remote work culture that fosters connection and innovation.
- Continuous learning and development opportunities, including access to training and certifications.
- The flexibility and convenience of a 100% distributed workforce – work from the location that suits you best!