Skip to content

Staff Site Reliability Engineer
Company | Blue Yonder |
---|
Location | Dallas, TX, USA |
---|
Salary | $122219 – $189615 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Expert or higher |
---|
Requirements
- Bachelor’s degree (or equivalent) in computer science or related discipline
- Strong experience of min 10 years’ experience developing, managing, or supporting distributed systems in a Cloud/IaaS environment, Azure preferred
- Expertise in Cloud Technologies and Cloud Delivery, CI/CD tools like Azure DevOps, GitHub, Jenkins, etc.
- Proficiency with several scripting / automation / OOL programming languages such as PowerShell, Python, Ruby, Groovy, Bash, and Java
- Experience working and managing virtual monitoring and visualization tools such as Splunk, AppDynamics, Elastic
- Solid understanding of large-scale applications, Cloud Observability, monitoring and fault management, and understanding of Network Architectures
- Proven track record of researching, understanding, and effectively applying Scalability and High Availability principles
- Experience coordinating between support and development teams to ensure effective delivery of monitoring services to the end-user
- Experience implementing best practices and industry standards for operational monitoring aligned to ITIL
- Strong communication and interpersonal skills, Analytical, problem-solving skills
- Ability to work efficiently in a fast-paced technical environment with increasing support demands and complexity
- Ability to manage multiple priorities and assigned tasks to meet deadlines and objectives
- Ability to collaborate and openly share ideas with a team of like-minded professionals
Responsibilities
- Ensure a well-running production environment through focus on reliability and holistic view of system health
- Respond to technical business requirements around availability, performance, and planned maintenance activities to ensure a well-operating solution and SLA compliance
- Directly own or participate in development of automation or other engineering deliverables to support reliability objectives
- Bring a strong engineering mindset and experience to achieve operational improvements, prevention of incidents, automation frameworks, self-service infrastructure, logging and metrics, and scorecards
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
- Participate in Agile team activities such as backlog grooming, planning, daily stand-ups, and retrospectives
- Keep up to date with technology, and continue to research latest trends in the industry
- Partner with Development and Infra teams to improve services through rigor and defined testing/release procedures along with Roadmap influence with focus towards reliability
- Drive Root Cause program innovations, ensuring completeness and quality, and achieving corrective and preventative action outcomes for each, in a blameless fashion
Preferred Qualifications
- 5+ years’ experience working in a large, matrix–driven corporate environment desired
- Agile experience through Scrum or other similar Sprint-based delivery approaches preferred