Site Reliability Engineer - SRE

Site Reliability Engineer – SRE

Company	Abridge
Location	San Francisco, CA, USA, New York, NY, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees
Experience Level	Senior, Expert or higher

Requirements

6+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling.
At least 2 years experience as a back-end engineer focused on system performance and scalability.
Experience building on Kubernetes and scaling compute services on Kubernetes; experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc.
Experience (or strong interest in) creating and maintaining CI/CD pipelines for both Infrastructure as code deployments as well as application code deployments. (Terragrunt, Atlas, ArgoCD, Octopus Deploy, Travis CI, etc.)
Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies. Candidates without GCP experience but who have experience with Kubernetes are encouraged to apply.
Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust).
Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations.
Passion for engineering enablement and solving software and distributed systems scaling challenges under pressure.

Responsibilities

Design and implement build pipelines, branching strategies, and release management tooling that will serve an engineering team that is doubling in size and massively growing the volume of code that is being shipped and that must be tested.
Build out our observability platform as the systems continue to grow, enabling logs, metrics, and distributed tracing at scale, as well as building out dashboards with golden metrics and error budgets that allow us to keep a pulse on the system’s growth.
Partner with teams to leverage observability tooling and observability driven development to identify bottlenecks in performance and availability and fix them.
Design and implement cloud security tooling and processes that strike an effective balance between engineering velocity and software and data security.
Help advocate for, design, implement, and adopt fast and scalable application testing pipelines including end to end UI tests as well as hyperscale load tests.
Help build an entirely new cloud stack that is fully provisioned through IaC and highly secure; leverage this to build ephemeral environments and multi-tenant/multi-region production deployments.
Uplevel our ability to respond to incidents by improving observability, runbooks, and incident response muscle across the organization.
Bridge the gap between local development and production environments in a way that is seamless for engineers and maximizes engineering velocity and security while minimizing quality issues arising from environment drift and configuration tangles.
Evangelize, document, and train the engineering team on the solutions being built and uplevel them on cloud native design strategies and tools.
Be a public evangelist for Abridge in the global platform engineering community, including conferences, open source, and research as we pioneer new AI-first cloud-native-first security-first implementations at scale.

Preferred Qualifications

Candidates without GCP experience but who have experience with Kubernetes are encouraged to apply.