Senior Director – Operations and Reliability Engineering
Company | Boston Consulting Group |
---|---|
Location | Boston, MA, USA, Stratford, London, UK |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior, Expert or higher |
Requirements
- 15+ years of experience in IT operations, SRE, DevOps, or platform engineering.
- 5+ years in a senior leadership role, managing large-scale IT environments.
- Deep technical expertise in cloud computing (AWS, Azure, GCP), on-prem infrastructure, and hybrid environments.
- Proven track record in end-to-end automation, Infrastructure as Code (IaC), and large-scale observability.
- Experience in AI-driven IT operations, predictive analytics, and automated remediation.
- Strong understanding of zero-trust security, regulatory compliance, and risk management.
- Excellent leadership, communication, and stakeholder management skills.
Responsibilities
- Define and execute a modern Reliability Engineering strategy, integrating SRE, DevOps, and automation-first operational models.
- Drive end-to-end automation to eliminate toil, improve efficiency, and enhance operational resilience.
- Lead the transition from traditional IT operations to a proactive, AI-driven, self-healing infrastructure.
- Establish a global observability, telemetry, and predictive analytics framework for real-time insights.
- Align operational strategies with business goals, ensuring IT supports digital transformation initiatives across BCG Core, BCG X, and CT.
- Oversee global IT infrastructure, cloud platforms, and hybrid hosting environments across all BCG business units.
- Manage network reliability, compute platforms, and cloud-native services across AWS, Azure, and GCP.
- Scale Infrastructure as Code (IaC), automated provisioning, and cloud workload optimization.
- Drive edge computing, containerized workloads, and high-performance computing strategies.
- Implement AI-driven monitoring, self-healing automation, and full-stack observability.
- Mandate and assure the adoption of IT Service Management (ITSM) processes across all teams, ensuring standardized, efficient, and effective service delivery.
- Establish SRE-based operational metrics, including SLOs, SLIs, and error budgets.
- Oversee incident response, problem resolution, and root cause analysis with AI-driven remediation.
- Ensure high availability, performance, and security compliance for all enterprise services.
- Develop a follow-the-sun operational support model, ensuring 24×7 resilience and uptime across all of BCG.
- Optimize incident, change, and capacity management, ensuring alignment with ITIL best practices and automated workflows.
- Lead Service Asset and Configuration Management (SACM), ensuring accurate and real-time management of software and IT assets within the CMDB.
- Drive continuous enhancements to the CMDB, improving visibility, compliance, and lifecycle management of IT assets.
- Embed security and compliance into operational workflows with automated security controls.
- Ensure adherence to ISO 27001, NIST, SOC 2, GDPR, and cloud security best practices.
- Collaborate with cybersecurity teams to integrate zero-trust security models.
- Drive resiliency planning, disaster recovery, and business continuity initiatives.
- Optimize IT operational budgets with a cost-effective, cloud-native strategy.
- Negotiate vendor contracts, ensuring alignment with business needs and service reliability.
- Drive cost efficiency in cloud spending, SaaS platforms, and infrastructure investments.
- Build and mentor a high-performing Reliability Engineering team, fostering a culture of automation and innovation.
- Lead a team of SREs, DevOps engineers, and platform reliability experts across global squads.
- Promote a collaborative, data-driven, and proactive mindset, ensuring agility and operational resilience.
- Establish workforce development programs for AI-driven operations, automation, and modern reliability practices.
Preferred Qualifications
- Certifications: ITIL, AWS/Azure/GCP Solutions Architect, SRE Foundation, CISSP, or equivalent.
- Experience with Kubernetes, Terraform, Ansible, and AI-powered operations tools.
- Strong problem-solving abilities, with a data-driven approach to operational excellence.