Senior Site Reliability Engineer
Company | Intercontinental Exchange |
---|---|
Location | Provo, UT, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Senior, Expert or higher |
Requirements
- 7+ years of Systems/Applications automation and incident response in 24×7 Production Services environments.
- BS in Computer Science, Computer Engineering, Math, or equivalent professional experience.
- Fluency with one or more current generation scripting language used by DevOps professionals (Powershell, Python, Ruby, PHP, Perl) or Java/.NET development.
- Excellent troubleshooter, utilizing a systematic problem-solving approach.
- Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems.
- Experience with infrastructure as code and configuration as code, utilizing tools like Terraform, CloudFormation, SpaceLift, Chef, SaltStack, Puppet, DSC.
- Knowledge of Windows Server and/or Linux systems internals (system libraries, file systems, kernel) and client-server network protocols.
- Experience with elastic scaling, fault tolerance and other cloud architecture patterns.
- Proven strength in SaaS services, experience in massive scale web operations.
- Experience operating on AWS or other public Cloud (both PaaS and IaaS offerings).
- Experience in Containerization/Docker/Micro-Services.
- Experience in Jenkins and build/deploy automation.
Responsibilities
- Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services.
- Work closely with development teams to ensure services are resilient and highly available.
- Implement proactive monitoring, alerting, trend analysis and self-healing systems.
- Coding and automation of applications on Cloud Platform.
- Define and measure KPIs and SLOs.
- Implement automated deployments, automated tests, and operational tools.
- Collaborate with Product and Support teams to plan and deploy product releases.
- Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems.
- Partner with other SREs and lead by example – contributor more than a delegator.
- Incident management during high stress issues and timelines.
- Follow incident management lifecycle. Ensure issues are well documented and tasks are accomplished to ensure incidents do not repeat.
Preferred Qualifications
-
No preferred qualifications provided.