Staff Site Reliability Engineer

Bachelor’s degree (or equivalent) in computer science or related discipline
Strong experience of min 10 years’ experience developing, managing, or supporting distributed systems in a Cloud/IaaS environment, Azure preferred
Expertise in Cloud Technologies and Cloud Delivery, CI/CD tools like Azure DevOps, GitHub, Jenkins, etc.
Proficiency with several scripting / automation / OOL programming languages such as PowerShell, Python, Ruby, Groovy, Bash, and Java
Experience working and managing virtual monitoring and visualization tools such as Splunk, AppDynamics, Elastic
Solid understanding of large-scale applications, Cloud Observability, monitoring and fault management, and understanding of Network Architectures
Proven track record of researching, understanding, and effectively applying Scalability and High Availability principles
Experience coordinating between support and development teams to ensure effective delivery of monitoring services to the end-user
Experience implementing best practices and industry standards for operational monitoring aligned to ITIL
Strong communication and interpersonal skills, Analytical, problem-solving skills
Ability to work efficiently in a fast-paced technical environment with increasing support demands and complexity
Ability to manage multiple priorities and assigned tasks to meet deadlines and objectives
Ability to collaborate and openly share ideas with a team of like-minded professionals

Ensure a well-running production environment through focus on reliability and holistic view of system health
Respond to technical business requirements around availability, performance, and planned maintenance activities to ensure a well-operating solution and SLA compliance
Directly own or participate in development of automation or other engineering deliverables to support reliability objectives
Bring a strong engineering mindset and experience to achieve operational improvements, prevention of incidents, automation frameworks, self-service infrastructure, logging and metrics, and scorecards
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
Participate in Agile team activities such as backlog grooming, planning, daily stand-ups, and retrospectives
Keep up to date with technology, and continue to research latest trends in the industry
Partner with Development and Infra teams to improve services through rigor and defined testing/release procedures along with Roadmap influence with focus towards reliability
Drive Root Cause program innovations, ensuring completeness and quality, and achieving corrective and preventative action outcomes for each, in a blameless fashion

5+ years’ experience working in a large, matrix–driven corporate environment desired
Agile experience through Scrum or other similar Sprint-based delivery approaches preferred