Posted in

Lead Site Reliability Engineer

Lead Site Reliability Engineer

CompanyWells Fargo
LocationCharlotte, NC, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of Site Reliability Engineering experience or related experience

Responsibilities

  • Work alongside developers as well as the business stakeholders and strive to automate the acceptance criteria
  • Maintain high reliability and availability for software applications
  • Automate the mundane tasks and avoid human errors
  • Define SLI (Service level indicator) & SLO (service level objective) by collaborating with Product owners
  • Lead incident response efforts and post-mortem analysis to prevent future occurrences
  • Write incident root cause analysis, find out the core reason behind the issue and prevent it from happening again
  • Document procedures, best practices and troubleshooting FAQs
  • Debug the system and fixing the production related issues
  • Escalate / follow-up on permanent fix for development related issues
  • Handle complex operational tasks and recommends process and technology changes
  • Provide global support including troubleshooting production related issues and performing checkouts

Preferred Qualifications

  • Strong understanding of the REST APIs
  • Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
  • Strong experience in API Management tools such as Apigee
  • Working knowledge of databases such as MongoDB, Oracle
  • Strong foundation in reliability engineering principles and distributed systems behavior
  • Experience defining and implementing SLOs/SLIs and using them to drive system improvements
  • Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
  • Understand modern observability practices and experience implementing and maintaining monitoring solutions such as Prometheus/Grafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
  • Strong incident response skills with experience leading incident retrospectives and driving improvements
  • Excellent problem-solving abilities and experience debugging distributed systems
  • Track record of successfully automating operations and reducing toil
  • Strong communication skills with ability to explain complex technical concepts to diverse audiences
  • Ability to work both independently and collaboratively (in groups) in an energetic, and diverse team environment.