Intermediate Site Reliability Engineer

Company	PointClickCare
Location	Mississauga, ON, Canada
Salary	$109800 – $118100
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Senior

Requirements

Bachelor’s Degree in Computer Science, Software Engineering, or related discipline.
Prior experience as a Site Reliability Engineer (SRE) in a previous role. (Minimum 5 years’ experience.)
Prior relevant software Development/Architecture/Engineering/DevOPS experience (Minimum 5 years’ experience).
Strong experience in building and supporting cloud-based solutions, Azure cloud infrastructure and services knowledge and experience preferred.
Experience with virtualization and container solutions such as Docker and Kubernetes.
Familiarity with Databricks, Event Hub, Redis, Azure Service Bus, Azure Functions, and Tomcat.
Experience with Windows based systems and Linux administration.
Experience with configuration management and deployment automation tools (e.g., Chef, Terraform, Puppet, Ansible, Jenkins, Spinnaker, ArgoCD, GitHub Actions).
Proficiency in programming languages such as Java, JavaScript and Python.
Working knowledge of database technologies (e.g., SQL Server, MySQL, PostgreSQL).
Experience with monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack, AppDynamics, DataDog).
Strong debugging and optimization skills, with the ability to automate routine tasks.
Systematic problem-solving approach with strong communication skills and a proactive mindset.
Knowledge of standard production practices, including change management and incident management (ITIL).
Experience building CI/CD pipelines and Blue/Green, Zero Downtime deployment strategies.
Troubleshooting experience with diverse hosting technologies, web servers, Java applications, operating systems, network components, and web browsers.

Responsibilities

Lead and implement SRE best practices to foster a strong SRE culture.
Coach, mentor and develop junior team members to grow into SRE’s.
Lead incident response calls to troubleshoot complex system and application-level issues.
Lead RCAs to capture lessons learnt and implement innovative solutions to prevent future incidents from re-occurring.
Design automated solutions to reduce manual tasks, enhance system reliability and reduce incident response time.
Implement and improve monitoring, alerting and logging utilizing tools such as ELK/Kibana, Prometheus, Grafana, AppD and Datadog.
Implement, monitor, and report on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for application services.
Collaborate with business and product owners to establish key performance indicators (KPIs).
Participate in technical training events, game day scenarios, and chaos engineering.
Provide support for a wide range of applications with a focus on increasing automation, repeatability, and consistency as well as self-healing.
Proactively improve application and infrastructure resiliency under various error and performance conditions.
Collaborate with security engineers to develop plans and automation for proactive response to new risks and vulnerabilities.
Provide architectural guidance to software development teams to enhance resiliency, efficiency, performance, and cost-effectiveness.
Implement and improve CI/CD pipelines to facilitate seamless and reliable software releases.
Participate in an on-call rotation to respond to incidents and ensure 24/7 system availability.

Preferred Qualifications

Proficiency in Linux, including experience compiling kernels, tracing syscalls, understanding TCP.
Knowledge of open-source software and contributions to the open-source community.
Familiarity with Rhapsody and various healthcare messaging standards, such as HL7 and FHIR.