Posted in

Staff Site Reliability Engineer

Staff Site Reliability Engineer

CompanyBlue Yonder
LocationDallas, TX, USA
Salary$122219 – $189615
TypeFull-Time
DegreesBachelor’s
Experience LevelExpert or higher

Requirements

  • Bachelor’s degree (or equivalent) in computer science or related discipline
  • Strong experience of min 10 years’ experience developing, managing, or supporting distributed systems in a Cloud/IaaS environment, Azure preferred
  • Expertise in Cloud Technologies and Cloud Delivery, CI/CD tools like Azure DevOps, GitHub, Jenkins, etc.
  • Proficiency with several scripting / automation / OOL programming languages such as PowerShell, Python, Ruby, Groovy, Bash, and Java
  • Experience working and managing virtual monitoring and visualization tools such as Splunk, AppDynamics, Elastic
  • Solid understanding of large-scale applications, Cloud Observability, monitoring and fault management, and understanding of Network Architectures
  • Proven track record of researching, understanding, and effectively applying Scalability and High Availability principles
  • Experience coordinating between support and development teams to ensure effective delivery of monitoring services to the end-user
  • Experience implementing best practices and industry standards for operational monitoring aligned to ITIL
  • Strong communication and interpersonal skills, Analytical, problem-solving skills
  • Ability to work efficiently in a fast-paced technical environment with increasing support demands and complexity
  • Ability to manage multiple priorities and assigned tasks to meet deadlines and objectives
  • Ability to collaborate and openly share ideas with a team of like-minded professionals

Responsibilities

  • Ensure a well-running production environment through focus on reliability and holistic view of system health
  • Respond to technical business requirements around availability, performance, and planned maintenance activities to ensure a well-operating solution and SLA compliance
  • Directly own or participate in development of automation or other engineering deliverables to support reliability objectives
  • Bring a strong engineering mindset and experience to achieve operational improvements, prevention of incidents, automation frameworks, self-service infrastructure, logging and metrics, and scorecards
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
  • Participate in Agile team activities such as backlog grooming, planning, daily stand-ups, and retrospectives
  • Keep up to date with technology, and continue to research latest trends in the industry
  • Partner with Development and Infra teams to improve services through rigor and defined testing/release procedures along with Roadmap influence with focus towards reliability
  • Drive Root Cause program innovations, ensuring completeness and quality, and achieving corrective and preventative action outcomes for each, in a blameless fashion

Preferred Qualifications

  • 5+ years’ experience working in a large, matrix–driven corporate environment desired
  • Agile experience through Scrum or other similar Sprint-based delivery approaches preferred