Senior Site Reliability Engineer

7+ years of Systems/Applications automation and incident response in 24×7 Production Services environments.
BS in Computer Science, Computer Engineering, Math, or equivalent professional experience.
Fluency with one or more current generation scripting language used by DevOps professionals (Powershell, Python, Ruby, PHP, Perl) or Java/.NET development.
Excellent troubleshooter, utilizing a systematic problem-solving approach.
Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems.
Experience with infrastructure as code and configuration as code, utilizing tools like Terraform, CloudFormation, SpaceLift, Chef, SaltStack, Puppet, DSC.
Knowledge of Windows Server and/or Linux systems internals (system libraries, file systems, kernel) and client-server network protocols.
Experience with elastic scaling, fault tolerance and other cloud architecture patterns.
Proven strength in SaaS services, experience in massive scale web operations.
Experience operating on AWS or other public Cloud (both PaaS and IaaS offerings).
Experience in Containerization/Docker/Micro-Services.
Experience in Jenkins and build/deploy automation.

Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services.
Work closely with development teams to ensure services are resilient and highly available.
Implement proactive monitoring, alerting, trend analysis and self-healing systems.
Coding and automation of applications on Cloud Platform.
Define and measure KPIs and SLOs.
Implement automated deployments, automated tests, and operational tools.
Collaborate with Product and Support teams to plan and deploy product releases.
Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems.
Partner with other SREs and lead by example – contributor more than a delegator.
Incident management during high stress issues and timelines.
Follow incident management lifecycle. Ensure issues are well documented and tasks are accomplished to ensure incidents do not repeat.

No preferred qualifications provided.