Lead Site Reliability Engineer

5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of Site Reliability Engineering experience or related experience

Work alongside developers as well as the business stakeholders and strive to automate the acceptance criteria
Maintain high reliability and availability for software applications
Automate the mundane tasks and avoid human errors
Define SLI (Service level indicator) & SLO (service level objective) by collaborating with Product owners
Lead incident response efforts and post-mortem analysis to prevent future occurrences
Write incident root cause analysis, find out the core reason behind the issue and prevent it from happening again
Document procedures, best practices and troubleshooting FAQs
Debug the system and fixing the production related issues
Escalate / follow-up on permanent fix for development related issues
Handle complex operational tasks and recommends process and technology changes
Provide global support including troubleshooting production related issues and performing checkouts

Strong understanding of the REST APIs
Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
Strong experience in API Management tools such as Apigee
Working knowledge of databases such as MongoDB, Oracle
Strong foundation in reliability engineering principles and distributed systems behavior
Experience defining and implementing SLOs/SLIs and using them to drive system improvements
Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
Strong incident response skills with experience leading incident retrospectives and driving improvements
Excellent problem-solving abilities and experience debugging distributed systems
Track record of successfully automating operations and reducing toil
Strong communication skills with ability to explain complex technical concepts to diverse audiences
Ability to work both independently and collaboratively (in groups) in an energetic, and diverse team environment.