Site Reliability Engineering Specialist
Company | Telesat |
---|---|
Location | Ottawa, ON, Canada |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Expert or higher |
Requirements
- Bachelor’s Degree in Computer Science or a related field
- Minimum nine years of experience in IT operations with a focus on reliability, uptime, availability and performance
- At least five years of hands-on provable experience with Microsoft Azure including deployment, management, and monitoring
- Expertise in automation and configuration management tools with demonstrable experience using tools such as Terraform and Ansible to automate infrastructure and application deployment
- Strong understanding of monitoring and observability tools with proven experience in monitoring tools such as Prometheus, Grafana, Nagios, or Splunk, and the ability to implement and maintain observability solutions
- CNCF Certified Kubernetes Administrator (CKA) would be considered an Asset for this role.
Responsibilities
- Work closely with Telesat’s cloud engineers to deploy and maintain our Kubernetes-based infrastructure
- Help maintain high availability, uptime and resiliency of our infrastructure
- Perform day-to-day operational tasks such as upgrades and patching of the Kubernetes platform
- Automate operational tasks
- Monitor the health of the platform and applications using Telesat’s observability platform
- Improve observability, define and measure SLOs
- Collaborate with development teams to resolve application issues
- Go on-call and respond to automated alerts and execute playbooks
- Identify gaps in processes, as well as build or improve tools to support incident management
- Facilitate incident response and conduct root cause analysis
Preferred Qualifications
- CNCF Certified Kubernetes Administrator (CKA) would be considered an Asset for this role.