Site-Reliability Engineer

5+ years of experience in a Site Reliability Engineering or related role
2+ years of experience focusing on improving observability and performance of applications
Mindful of the tradeoffs with various infrastructure choices and how they impact uptime
Focused on delighting customers by establishing clear expectations
Experience evangelizing technical concepts is a must

Collaborate to achieve highly available and scalable Azure cloud infrastructure supporting 24/7 warehouse automation applications
Experienced in leading teams to define healthy observability practices with tools such as New Relic, DataDog, Sentry, Prometheus, and Grafana
Work with application engineers to establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for core application features
Firm understanding of root cause analysis and comfortable coaching teams on improving existing practices
Comfortable with Terraform to ensure consistent and repeatable deployments
Experienced with multiple CNCF projects such as Helm, Flux, Argo, Kubernetes, Prometheus, and Grafana
Create tools and automation to streamline development workflows and enable safe, efficient application deployments
Collaborate with product squads to assess risks and develop mitigation strategies for system reliability
Implement security best practices and ensure compliance with industry standards across cloud infrastructure
Serve as a technical evangelist for reliability engineering principles and best practices across the organization
Mentor software engineers on building reliable, observable applications while continuously improving operational efficiency

No preferred qualifications provided.