Sr. SRE Manager – Site Reliability
Company | Lucid Motors |
---|---|
Location | Newark, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior, Expert or higher |
Requirements
- 10+ years of experience in Site Reliability Engineering or a related field, with 3+ years in a leadership role.
- Deep expertise in managing AWS infrastructure, including EC2, S3, RDS, Lambda, CloudWatch, and others.
- Strong experience with Kubernetes (EKS or OKE) for orchestrating containerized applications at scale.
- Proven expertise with CI/CD tools, especially GitLab, ArgoCD, or similar.
- Experience with Grafana, Prometheus, or other monitoring tools to provide real-time observability and performance insights.
- Solid understanding of Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Strong programming or scripting skills (e.g., Python, Bash, Go, etc.).
- Excellent problem-solving skills with a focus on root cause analysis and continuous improvement.
- Ability to collaborate and communicate effectively across technical and non-technical teams.
Responsibilities
- Lead and mentor a team of SREs, providing guidance, technical support, and career development.
- Build and scale the SRE team, hiring top talent and fostering a collaborative, results-driven environment.
- Own the strategy for ensuring service reliability, availability, and scalability in production environments.
- Work with leadership and cross-functional teams to set performance goals and track progress towards reliability metrics (SLOs, SLIs, SLAs).
- Champion a DevOps culture by promoting collaboration between engineering, operations, and product teams.
- Design and implement scalable, resilient, and secure cloud infrastructure on AWS, with a focus on high availability and fault tolerance.
- Lead automation initiatives for infrastructure provisioning, application deployment, and monitoring using tools such as Terraform, Ansible, and CloudFormation.
- Manage and optimize Kubernetes clusters (EKS or OKE) for container orchestration and ensure smooth operations of microservices.
- Implement and continuously improve CI/CD pipelines using tools like GitLab, Argo CD, and others, enabling seamless, automated deployments.
- Develop and enforce best practices for monitoring, alerting, and incident management using tools such as Grafana, Prometheus, and CloudWatch.
- Drive the definition and implementation of reliability best practices, focusing on minimizing downtime and improving system uptime.
- Lead incident response, managing post-mortem analysis and implementing improvements to avoid recurrence.
- Ensure effective disaster recovery plans are in place, regularly tested, and meet the business continuity objectives.
- Develop and track operational metrics that reflect the health and reliability of production systems.
- Implement and enforce service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs) for cloud systems.
Preferred Qualifications
- Experience with Oracle Cloud Infrastructure (OCI), understanding multi-cloud strategies and hybrid cloud environments.
- Knowledge of OpenVPN or other VPN solutions for secure network access and infrastructure management.
- Hands-on experience with EMQX, especially for real-time data streaming or messaging in IoT or event-driven applications.
- Knowledge of Argo CD for continuous delivery and GitOps-based operations.
- Experience in Data Engineering practices, including data pipelines, ETL processes, and integration with big data or analytics platforms.