Posted in

Senior Site Reliability Engineer

Senior Site Reliability Engineer

CompanySpreedly
LocationUnited States
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • Hands-on experience with Datadog, OpenTelemetry, Sentry, and Sumo Logic or similar monitoring and observability platforms, with a focus on actionable metrics and alerts.
  • Proficiency in a modern programming language, with a proven ability to write clean, maintainable, and efficient code. Ruby, Rails, and Elixir experience are preferred.
  • Experience with AWS services, including EC2 (Ubuntu Linux), S3, and RDS.
  • In-depth knowledge of relational databases (e.g., CockroachDB, PostgreSQL, Riak) with experience in performance optimization and query tuning. Experience with Kafka is a plus.
  • Experience applying design patterns to enhance reliability, scalability, and performance in application development.
  • Excellent problem-solving skills with experience diagnosing complex system issues in production environments.
  • Proven ability to work cross-functionally with product and application, infrastructure, and security engineering teams.
  • Strong written and verbal communication skills, with the ability to explain complex technical concepts to non-technical stakeholders.

Responsibilities

  • Ensure the reliability, availability, and performance of Spreedly’s globally distributed payments platform, processing $4B monthly production systems through monitoring, automation, and continuous improvement.
  • Collaborate with development teams to improve the reliability and performance of Ruby on Rails and Elixir applications.
  • Implement and maintain robust observability solutions using Datadog and OpenTelemetry, enabling proactive identification alerting, and resolution of issues.
  • Lead incident response efforts by participating in a shared on-call rotation to maintain 24/7 system reliability, including root cause analysis, resolution, and implementing measures to prevent recurrence.
  • Develop and maintain automation tools to reduce manual intervention, streamline operations, and enhance developer productivity.
  • Monitor, analyze, and optimize the performance of relational databases, identifying and resolving bottlenecks to maintain data integrity and efficiency.
  • Lead by example, infusing modern SRE best practices and fostering a culture of reliability and performance within the engineering organization.
  • Provide technical guidance and mentorship to team members, fostering a culture of learning and collaboration.

Preferred Qualifications

  • Ruby, Rails, and Elixir experience are preferred.
  • Experience with Kafka is a plus.