Posted in

Staff Site Reliability Engineer

Staff Site Reliability Engineer

CompanyIllumio
LocationSunnyvale, CA, USA
Salary$192000 – $230000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field; or equivalent work experience
  • 6+ years of relevant SRE, DevOps, Platform or Infrastructure Engineering experience
  • 4+ years in production support role in a fast-paced industry/organization
  • Experience deploying, tuning, and maintaining Linux-based, highly available, fault-tolerant web platforms in public cloud providers such as AWS, Azure, and GCP
  • Common monitoring, log aggregation, and metrics gathering platforms experience (Icinga, Sensu, Splunk, Telegraf/InfluxDB, et. al.)
  • Configuration management & orchestration tools experience like Chef, Ansible, and AWS Services & APIs, or equivalent
  • Experience scripting/coding with Python, Java, Ruby and/or Go
  • Experience with MySQL, PostgreSQL, Redis, or similar
  • Solid knowledge of Linux operating system, Ubuntu, RHEL, OEL7 is required
  • EKS and/or AKS frameworks
  • Knowledge/Experience of Incident Management/on-call: PagerDuty
  • Knowledge of Database Technologies, Release Management, REST, SRE, etc.
  • Load balancers/ Traffic manager knowledge
  • Experience working with Kubernetes, Docker, or other virtualization & containerization technologies
  • Networking basics and trouble shooting skills
  • Good understanding of Production deployment, Distributed Environments required
  • Strong problem solving and operational process skills, attention to detail
  • Application support and debugging experience in a dynamic fast-paced production environment
  • Experience with SDLC principles, architecture and operations
  • Experience working with senior leadership both inside and outside of engineering
  • Ability to manage multiple tasks and competing priorities to deliver projects on schedule

Responsibilities

  • Driving reliability improvements back into applications
  • Building code to resolve reliability/resiliency issues
  • Mentor and educate team members to aid in strengthening technical expertise
  • Collaborate closely with cloud architects to drive cloud solutions
  • Curating proper SLI/SLOs to accurately measure or assess error budgets
  • Embed with the development teams to assist with cloud methodologies when developing products to ensure that the deliverable is as reliable as possible
  • Work with development teams to build and strengthen application security and compliance
  • Manage high impact situations that involve technically challenging issues across diverse audiences and drive to find the root cause, mitigate, and identify a solution
  • Focus on observability

Preferred Qualifications

  • Azure certifications such as Azure Administrator, Azure Developer, or AWS/GCP certifications are a plus