Site Reliability Engineer III

Formal training or certification on Site Reliability concepts and 3+ years applied experience
Proficiency in at least one programming language such as Python, Java/Spring Boot, or .Net.
Strong knowledge of cloud platforms (e.g., AWS, Azure, Google Cloud) and virtualization technologies.
Experience with monitoring and logging tools such as Prometheus, Grafana, Splunk, Dynatrace, and Datadog.
Proficiency in continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform.
Experience with container and container orchestration technologies such as ECS, Kubernetes, and Docker.
Experience implementing and managing error budgets and familiarity with site reliability culture and principles.
Proficient knowledge of software applications and technical processes within disciplines like Cloud and artificial intelligence.
Familiarity with troubleshooting common networking technologies and issues.
Ability to contribute to large and collaborative teams by presenting information logically and effectively.
Ability to proactively recognize roadblocks, identify new technologies, and implement innovative solutions to solve business problems.

Collaborate with cross-functional teams to design, implement, and maintain scalable and reliable systems.
Develop and maintain monitoring, alerting, and incident response systems to ensure high availability, performance, and quality.
Implement and manage service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs) to measure and improve system reliability and customer satisfaction.
Utilize error budgets to manage delivery and prioritize reliability improvements for the applications and platforms we own and support.
Automate repetitive tasks and processes to improve efficiency and achieve reduction of toil, minimizing manual intervention.
Participate in on-call rotations to provide 24/7 support for critical systems and services.
Conduct root cause analysis and blameless post-mortem reviews to prevent future incidents and improve system reliability.