Sr. SRE Manager - Site Reliability

Sr. SRE Manager – Site Reliability

Company	Lucid Motors
Location	Newark, CA, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees
Experience Level	Senior, Expert or higher

Requirements

10+ years of experience in Site Reliability Engineering or a related field, with 3+ years in a leadership role.
Deep expertise in managing AWS infrastructure, including EC2, S3, RDS, Lambda, CloudWatch, and others.
Strong experience with Kubernetes (EKS or OKE) for orchestrating containerized applications at scale.
Proven expertise with CI/CD tools, especially GitLab, ArgoCD, or similar.
Experience with Grafana, Prometheus, or other monitoring tools to provide real-time observability and performance insights.
Solid understanding of Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
Strong programming or scripting skills (e.g., Python, Bash, Go, etc.).
Excellent problem-solving skills with a focus on root cause analysis and continuous improvement.
Ability to collaborate and communicate effectively across technical and non-technical teams.

Responsibilities

Lead and mentor a team of SREs, providing guidance, technical support, and career development.
Build and scale the SRE team, hiring top talent and fostering a collaborative, results-driven environment.
Own the strategy for ensuring service reliability, availability, and scalability in production environments.
Work with leadership and cross-functional teams to set performance goals and track progress towards reliability metrics (SLOs, SLIs, SLAs).
Champion a DevOps culture by promoting collaboration between engineering, operations, and product teams.
Design and implement scalable, resilient, and secure cloud infrastructure on AWS, with a focus on high availability and fault tolerance.
Lead automation initiatives for infrastructure provisioning, application deployment, and monitoring using tools such as Terraform, Ansible, and CloudFormation.
Manage and optimize Kubernetes clusters (EKS or OKE) for container orchestration and ensure smooth operations of microservices.
Implement and continuously improve CI/CD pipelines using tools like GitLab, Argo CD, and others, enabling seamless, automated deployments.
Develop and enforce best practices for monitoring, alerting, and incident management using tools such as Grafana, Prometheus, and CloudWatch.
Drive the definition and implementation of reliability best practices, focusing on minimizing downtime and improving system uptime.
Lead incident response, managing post-mortem analysis and implementing improvements to avoid recurrence.
Ensure effective disaster recovery plans are in place, regularly tested, and meet the business continuity objectives.
Develop and track operational metrics that reflect the health and reliability of production systems.
Implement and enforce service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs) for cloud systems.

Preferred Qualifications

Experience with Oracle Cloud Infrastructure (OCI), understanding multi-cloud strategies and hybrid cloud environments.
Knowledge of OpenVPN or other VPN solutions for secure network access and infrastructure management.
Hands-on experience with EMQX, especially for real-time data streaming or messaging in IoT or event-driven applications.
Knowledge of Argo CD for continuous delivery and GitOps-based operations.
Experience in Data Engineering practices, including data pipelines, ETL processes, and integration with big data or analytics platforms.