Responsibilities:
Senior Site Reliability Engineer with a proven track record for delivering software infrastructure while working closely with software engineering teams.
You will develop, maintain and scale our infrastructure for deploying and monitoring our software using the latest tools and
methodologies, including agile, CI/CD, and infrastructure as code.
Contribute to the advancement of both software development and cloud infrastructure efforts
Partner with developers to apply best practices to ensure full working test and production environments using logging/monitoring tools, alerting/notification tools, and any other tools that help reduce time-to-detect/time-to-mitigate & tools for disaster recovery, high availability, and business continuity
Design, build and maintain CI/CD, testing, and operations infrastructure for our systems
Create documentation, run-books, and operational standards with a focus on automation
Support Production systems and respond to operational incidents
Required Skills:
You have 5+ years of relevant experience in Site Reliability Engineer (SRE) / DevOps roles
You have 3+ years of experience with a public cloud provider (AWS, Azure, GCP)
You have previous experience in Kubernetes (K8s) and its related ecosystem of tooling
You have a solid command of Linux systems and the networking stack
You have experience debugging distributed systems (DataDog, APM, Monitoring and Alerting)
You have experience automating everything you can with infrastructure automation tools (e.g. Terraform)
Automation skills in shell bash, Python, and/or other languages
You have experience implementing and mentoring others on best practices, including automated testing, security, continuous integration, and continuous deployment (CI/CD)
Strong networking knowledge of TCP/IP