Sr Site Reliability Engineer

Peopleconnect • Bellevue, Washington, United States • 3w ago

Position Summary:

As a Senior Site Reliability Engineer at Classmates.com, you will be responsible for designing, implementing, and maintaining the infrastructure and systems necessary to support our applications and services. You will work closely with cross-functional teams to drive operational excellence, automate processes, and continuously improve system reliability. You will be a specialist on complex technical and business matters. Your expertise in cloud technologies, automation, and performance optimization will be key to the success of our engineering and operations efforts. In this role you will be collaborating heavily with the team, oftentimes multitasking, and consistently driving projects to completion.

The position will require a good mix of steadfast persistence, innovative thinking, ability to interpret performance data, and good people skills. If you can do all that while having fun, even better!

Local area candidates are encouraged to apply, and please note we are not able to offer visa sponsorship, visa transfer, or corp-corp arrangements.

This role is a Hybrid position and requires 2-3 days in the office located in Bellevue, WA.

Key Responsibilities:

Cloud Strategy and Architecture

Provide thought leadership, mentorship, and technical vision related to site reliability, DevOps, and a ‘cloud-first’ culture.
Analyze and implement cloud services to meet business goals, focusing on cost optimizations, efficiencies, and scalability.
Drive orchestration efforts for cloud services, design self-service aspects, and stay updated with emerging cloud technologies.

Infrastructure Automation and Design

Collaborate on designing, building, and maintaining scalable infrastructure across cloud and on-prem environments.
Automate provisioning and configuration using tools like Terraform, Terragrunt, and Puppet.
Develop automation scripts, maintain CI/CD pipelines, and plan for scalability and capacity, conducting load testing as needed.

Reliability and Performance Engineering

Ensure system reliability, availability, and performance through monitoring, alerting, and incident response.
Implement and manage SLOs/SLIs to meet reliability standards.
Identify and address performance bottlenecks across the infrastructure and application stack.
Build and maintain observability solutions (e.g., monitoring, logging, and tracing) and improve system health dashboards.

Security and Compliance:

Implement security measures for Cloud Native applications and ensure compliance with industry standards (SOC2, PCI, etc).
Collaborate with security teams to audit and monitor systems, continuously updating security configurations and dashboards.

Incident Management and Root Cause Analysis:

Participate in on-call rotations to provide 24/7 support for production environment.
Lead incident response activities and perform root cause analysis to prevent recurring incidents.
Conduct and document post-incident retrospectives (postmortems) to drive continuous improvement.
Create and Maintain runbooks and operational documentation for continuous improvement.
Proactively test system resilience through Chaos Engineering experiments and failure injection.

Disaster Recovery and Business Continuity

Design and test disaster recovery (DR) and business continuity strategies, ensuring backup and failover mechanisms are effective.

Cost Management and Financial Optimization

Monitor cloud usage and implement financial optimization practices (FinOps) to control infrastructure costs.
Collaborate with stakeholders to drive financial efficiency.

Collaboration, Knowledge Sharing, and Communication:

Collaborate across teams to ensure alignment and effective project implementation.
Communicate during incidents and changes, providing transparency to stakeholders.
Mentor and share knowledge with team members to foster a collaborative and continuous learning environment.
Maintain comprehensive documentation of system configurations, processes, and best practices.

Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
5+ years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments.
Proficiency in AWS and containerization technologies like Kubernetes and Docker.
Strong experience with Infrastructure as Code (IaC) using Terraform, with automation scripting skills in Python, Bash/Shell, or Go.
Deep knowledge of Linux/Unix systems and networking fundamentals (e.g., TCP/IP, DNS, HTTP, VPN).
Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana) and incident management.
Familiarity with CI/CD pipelines, preferably using tools like GitLab, and strong knowledge of DevOps practices.
Excellent troubleshooting skills, with experience in performance optimization and root cause analysis.
Strong communication and collaboration skills.
Bonus skills: experience with Rundeck, Java, Spring Framework, Terragrunt, Puppet, Vector, Loki, VictoriaMetrics, and additional cloud platforms (e.g., GCP, Azure), as well as relevant certifications such as AWS Solutions Architect or Certified Kubernetes Administrator (CKA).

Classmates
Classmates is the premier online, social, and mobile destination for reconnecting with the people from your high school years. Classmates offers the largest digitized collection of high school yearbooks online, with over 450,000 available to view, tag, sign, and share, and has the most comprehensive directory of high schools and class lists from the 1940s to today.

Salary Range:

Min: $152,700
Mid: $170,800
Max: $190,600

The pay range reflects the salary amount the Company reasonably expects to pay for the position. It is not a guarantee of actual compensation or a specific payment amount to any candidate. The actual compensation will depend on numerous factors including, without limitation, a particular candidate’s experience and qualifications.

The Company's Applicant and Worker Privacy Notice can be found here.

PeopleConnect is an equal opportunity employer.

Local area candidates are encouraged to apply, and please note we are not able to offer visa sponsorship, visa transfer, or corp-corp arrangements.

Note for Principal Agencies - Principal agents should not forward resumes to PeopleConnect, as we will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.

PeopleConnect, Inc. is an equal opportunity employer