Senior Site Reliability Engineer

Hands-on experience with Datadog, OpenTelemetry, Sentry, and Sumo Logic or similar monitoring and observability platforms, with a focus on actionable metrics and alerts.
Proficiency in a modern programming language, with a proven ability to write clean, maintainable, and efficient code. Ruby, Rails, and Elixir experience are preferred.
Experience with AWS services, including EC2 (Ubuntu Linux), S3, and RDS.
In-depth knowledge of relational databases (e.g., CockroachDB, PostgreSQL, Riak) with experience in performance optimization and query tuning. Experience with Kafka is a plus.
Experience applying design patterns to enhance reliability, scalability, and performance in application development.
Excellent problem-solving skills with experience diagnosing complex system issues in production environments.
Proven ability to work cross-functionally with product and application, infrastructure, and security engineering teams.
Strong written and verbal communication skills, with the ability to explain complex technical concepts to non-technical stakeholders.

Ensure the reliability, availability, and performance of Spreedly’s globally distributed payments platform, processing $4B monthly production systems through monitoring, automation, and continuous improvement.
Collaborate with development teams to improve the reliability and performance of Ruby on Rails and Elixir applications.
Implement and maintain robust observability solutions using Datadog and OpenTelemetry, enabling proactive identification alerting, and resolution of issues.
Lead incident response efforts by participating in a shared on-call rotation to maintain 24/7 system reliability, including root cause analysis, resolution, and implementing measures to prevent recurrence.
Develop and maintain automation tools to reduce manual intervention, streamline operations, and enhance developer productivity.
Monitor, analyze, and optimize the performance of relational databases, identifying and resolving bottlenecks to maintain data integrity and efficiency.
Lead by example, infusing modern SRE best practices and fostering a culture of reliability and performance within the engineering organization.
Provide technical guidance and mentorship to team members, fostering a culture of learning and collaboration.