Posted in

Sr. SRE Manager – Site Reliability

Sr. SRE Manager – Site Reliability

CompanyLucid Motors
LocationNewark, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior, Expert or higher

Requirements

  • 10+ years of experience in Site Reliability Engineering or a related field, with 3+ years in a leadership role.
  • Deep expertise in managing AWS infrastructure, including EC2, S3, RDS, Lambda, CloudWatch, and others.
  • Strong experience with Kubernetes (EKS or OKE) for orchestrating containerized applications at scale.
  • Proven expertise with CI/CD tools, especially GitLab, ArgoCD, or similar.
  • Experience with Grafana, Prometheus, or other monitoring tools to provide real-time observability and performance insights.
  • Solid understanding of Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
  • Strong programming or scripting skills (e.g., Python, Bash, Go, etc.).
  • Excellent problem-solving skills with a focus on root cause analysis and continuous improvement.
  • Ability to collaborate and communicate effectively across technical and non-technical teams.

Responsibilities

  • Lead and mentor a team of SREs, providing guidance, technical support, and career development.
  • Build and scale the SRE team, hiring top talent and fostering a collaborative, results-driven environment.
  • Own the strategy for ensuring service reliability, availability, and scalability in production environments.
  • Work with leadership and cross-functional teams to set performance goals and track progress towards reliability metrics (SLOs, SLIs, SLAs).
  • Champion a DevOps culture by promoting collaboration between engineering, operations, and product teams.
  • Design and implement scalable, resilient, and secure cloud infrastructure on AWS, with a focus on high availability and fault tolerance.
  • Lead automation initiatives for infrastructure provisioning, application deployment, and monitoring using tools such as Terraform, Ansible, and CloudFormation.
  • Manage and optimize Kubernetes clusters (EKS or OKE) for container orchestration and ensure smooth operations of microservices.
  • Implement and continuously improve CI/CD pipelines using tools like GitLab, Argo CD, and others, enabling seamless, automated deployments.
  • Develop and enforce best practices for monitoring, alerting, and incident management using tools such as Grafana, Prometheus, and CloudWatch.
  • Drive the definition and implementation of reliability best practices, focusing on minimizing downtime and improving system uptime.
  • Lead incident response, managing post-mortem analysis and implementing improvements to avoid recurrence.
  • Ensure effective disaster recovery plans are in place, regularly tested, and meet the business continuity objectives.
  • Develop and track operational metrics that reflect the health and reliability of production systems.
  • Implement and enforce service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs) for cloud systems.

Preferred Qualifications

  • Experience with Oracle Cloud Infrastructure (OCI), understanding multi-cloud strategies and hybrid cloud environments.
  • Knowledge of OpenVPN or other VPN solutions for secure network access and infrastructure management.
  • Hands-on experience with EMQX, especially for real-time data streaming or messaging in IoT or event-driven applications.
  • Knowledge of Argo CD for continuous delivery and GitOps-based operations.
  • Experience in Data Engineering practices, including data pipelines, ETL processes, and integration with big data or analytics platforms.