Posted in

Senior Site Reliability Engineer

Senior Site Reliability Engineer

CompanyIntercontinental Exchange
LocationProvo, UT, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • 7+ years of Systems/Applications automation and incident response in 24×7 Production Services environments.
  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience.
  • Fluency with one or more current generation scripting language used by DevOps professionals (Powershell, Python, Ruby, PHP, Perl) or Java/.NET development.
  • Excellent troubleshooter, utilizing a systematic problem-solving approach.
  • Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems.
  • Experience with infrastructure as code and configuration as code, utilizing tools like Terraform, CloudFormation, SpaceLift, Chef, SaltStack, Puppet, DSC.
  • Knowledge of Windows Server and/or Linux systems internals (system libraries, file systems, kernel) and client-server network protocols.
  • Experience with elastic scaling, fault tolerance and other cloud architecture patterns.
  • Proven strength in SaaS services, experience in massive scale web operations.
  • Experience operating on AWS or other public Cloud (both PaaS and IaaS offerings).
  • Experience in Containerization/Docker/Micro-Services.
  • Experience in Jenkins and build/deploy automation.

Responsibilities

  • Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services.
  • Work closely with development teams to ensure services are resilient and highly available.
  • Implement proactive monitoring, alerting, trend analysis and self-healing systems.
  • Coding and automation of applications on Cloud Platform.
  • Define and measure KPIs and SLOs.
  • Implement automated deployments, automated tests, and operational tools.
  • Collaborate with Product and Support teams to plan and deploy product releases.
  • Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems.
  • Partner with other SREs and lead by example – contributor more than a delegator.
  • Incident management during high stress issues and timelines.
  • Follow incident management lifecycle. Ensure issues are well documented and tasks are accomplished to ensure incidents do not repeat.

Preferred Qualifications

    No preferred qualifications provided.