Lead Site Reliability Engineer
Company | Wells Fargo |
---|---|
Location | Charlotte, NC, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 5+ years of Site Reliability Engineering experience or related experience
Responsibilities
- Work alongside developers as well as the business stakeholders and strive to automate the acceptance criteria
- Maintain high reliability and availability for software applications
- Automate the mundane tasks and avoid human errors
- Define SLI (Service level indicator) & SLO (service level objective) by collaborating with Product owners
- Lead incident response efforts and post-mortem analysis to prevent future occurrences
- Write incident root cause analysis, find out the core reason behind the issue and prevent it from happening again
- Document procedures, best practices and troubleshooting FAQs
- Debug the system and fixing the production related issues
- Escalate / follow-up on permanent fix for development related issues
- Handle complex operational tasks and recommends process and technology changes
- Provide global support including troubleshooting production related issues and performing checkouts
Preferred Qualifications
- Strong understanding of the REST APIs
- Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
- Strong experience in API Management tools such as Apigee
- Working knowledge of databases such as MongoDB, Oracle
- Strong foundation in reliability engineering principles and distributed systems behavior
- Experience defining and implementing SLOs/SLIs and using them to drive system improvements
- Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
- Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
- Strong incident response skills with experience leading incident retrospectives and driving improvements
- Excellent problem-solving abilities and experience debugging distributed systems
- Track record of successfully automating operations and reducing toil
- Strong communication skills with ability to explain complex technical concepts to diverse audiences
- Ability to work both independently and collaboratively (in groups) in an energetic, and diverse team environment.