Skip to content

Intermediate Site Reliability Engineer
Company | PointClickCare |
---|
Location | Mississauga, ON, Canada |
---|
Salary | $109800 – $118100 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior |
---|
Requirements
- Bachelor’s Degree in Computer Science, Software Engineering, or related discipline.
- Prior experience as a Site Reliability Engineer (SRE) in a previous role. (Minimum 5 years’ experience.)
- Prior relevant software Development/Architecture/Engineering/DevOPS experience (Minimum 5 years’ experience).
- Strong experience in building and supporting cloud-based solutions, Azure cloud infrastructure and services knowledge and experience preferred.
- Experience with virtualization and container solutions such as Docker and Kubernetes.
- Familiarity with Databricks, Event Hub, Redis, Azure Service Bus, Azure Functions, and Tomcat.
- Experience with Windows based systems and Linux administration.
- Experience with configuration management and deployment automation tools (e.g., Chef, Terraform, Puppet, Ansible, Jenkins, Spinnaker, ArgoCD, GitHub Actions).
- Proficiency in programming languages such as Java, JavaScript and Python.
- Working knowledge of database technologies (e.g., SQL Server, MySQL, PostgreSQL).
- Experience with monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack, AppDynamics, DataDog).
- Strong debugging and optimization skills, with the ability to automate routine tasks.
- Systematic problem-solving approach with strong communication skills and a proactive mindset.
- Knowledge of standard production practices, including change management and incident management (ITIL).
- Experience building CI/CD pipelines and Blue/Green, Zero Downtime deployment strategies.
- Troubleshooting experience with diverse hosting technologies, web servers, Java applications, operating systems, network components, and web browsers.
Responsibilities
- Lead and implement SRE best practices to foster a strong SRE culture.
- Coach, mentor and develop junior team members to grow into SRE’s.
- Lead incident response calls to troubleshoot complex system and application-level issues.
- Lead RCAs to capture lessons learnt and implement innovative solutions to prevent future incidents from re-occurring.
- Design automated solutions to reduce manual tasks, enhance system reliability and reduce incident response time.
- Implement and improve monitoring, alerting and logging utilizing tools such as ELK/Kibana, Prometheus, Grafana, AppD and Datadog.
- Implement, monitor, and report on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for application services.
- Collaborate with business and product owners to establish key performance indicators (KPIs).
- Participate in technical training events, game day scenarios, and chaos engineering.
- Provide support for a wide range of applications with a focus on increasing automation, repeatability, and consistency as well as self-healing.
- Proactively improve application and infrastructure resiliency under various error and performance conditions.
- Collaborate with security engineers to develop plans and automation for proactive response to new risks and vulnerabilities.
- Provide architectural guidance to software development teams to enhance resiliency, efficiency, performance, and cost-effectiveness.
- Implement and improve CI/CD pipelines to facilitate seamless and reliable software releases.
- Participate in an on-call rotation to respond to incidents and ensure 24/7 system availability.
Preferred Qualifications
- Proficiency in Linux, including experience compiling kernels, tracing syscalls, understanding TCP.
- Knowledge of open-source software and contributions to the open-source community.
- Familiarity with Rhapsody and various healthcare messaging standards, such as HL7 and FHIR.