Site Reliability Engineer

5+ years of professional experience in a fast-paced SaaS or a similar business environment
3+ years of hands-on experience supporting production systems as a Site Reliability Engineer (SRE) or a DevOps Engineer
3+ years of hands-on experience with cloud services and technologies (GCP, AWS, Azure, etc.)
Experience with containerization and orchestration tools (e.g., Docker, Kubernetes)
Proficient in Infrastructure as Code (IaC) tools and methodologies (e.g. Terraform, Pulumi, Puppet, etc.)
Proven ability to troubleshoot and resolve complex technical issues in distributed systems
Ability to communicate effectively within the team and across the organization while sharing insights and updates and collaborating to achieve project goals

Implement and maintain monitoring systems to proactively identify and address potential issues before they impact users.
Automate repetitive tasks and processes, such as deployments, infrastructure management, and incident response, to improve efficiency and reduce manual effort.
Respond to incidents, diagnose problems, and implement solutions to restore service quickly.
Improve the performance and scalability of systems and applications, ensuring they can handle peak loads and user traffic.
Help plan future capacity needs, ensuring that systems can accommodate growth and evolving requirements, while remaining cost-efficient.
Work closely with development teams to understand their needs, guide them, and ensure that systems are designed and deployed reliably.
Build tools to codify and automate infrastructure operations.
Define and track SLIs and SLOs to measure the performance and reliability of services.
Assess and mitigate risks associated with deployments and infrastructure changes.
Assist with the release and deployment processes, ensuring that changes are rolled out smoothly and reliably.

Advanced working knowledge of GCP Services like GKE, GCS, IAM, etc.
Professional experience supporting containerized Java/JVM/Python services
Experience with relational databases, particularly PostgreSQL
3+ years of professional experience designing, and implementing and/or administering CI/CD solutions (e.g. Github Actions, Buildkite, Jenkins, etc.)
Strong SRE mindset with focus on cloud networking and security best practices
Strong software development, particularly with scripting languages (e.g., Python, Bash, etc.)
Experience with system administration in general and Linux in particular
Familiarity with SOC2 / ISO 27001 security frameworks
Preference for someone in the Pacific / Mountain time zone